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PREFACE 


The origins of this book lie in the author’s experiences, as student, 
research worker and lecturer, over the past 15 years. The intricacies 
and essential characteristics of statistical methods were first intro- 
duced to him as a student by Professor P. R. Crowe, when the latter 
was a Reader in the University of London. The value of such methods 
of analysis has been increasingly appreciated as research, especially 
in the field of climatology, has been pursued during succeeding years. 

For the non-mathematician, however, even the simpler introduc- 
tory books on statistics often raise considerable problems. These are 
accentuated, moreover, by the fact that the methods are applied to 
fields of study which are, in large measure at least, unfamiliar to the 
geographer—industrial or business control, sociology or economic 
theory, the biological sciences or medicine, or simply as a study in 
applied mathematics. Moreover, most geographical studies that have 
employed statistical techniques have equally tended simply to assume 
that the reader would understand the methods despite the normal 
lack of formal statistical training. 

In an attempt to counteract these tendencies, training in statistical 
methods for geography students was expanded at Liverpool Univer- 
sity in 1957. This training aimed at providing a grounding in a variety 
of basic methods, all of which were developed and applied in terms 
of geographical problems. From the course has evolved the present 
book, which it is hoped will provide a similar basic grounding for all 
geographers. Throughout the evolution of this course, and especially 
in encouraging me to expand it in the present form, I have had every 
support from Professor R. W. Steel. It is my former colleague, Dr 
A. T. A. Learmonth, however (now Professor of Geography in the 
School of General Studies, Australian National University, Can- 
berra), to whom the greatest debt is owed, for his unfailing willing- 
ness to discuss and constructively criticize my efforts, for his per- 
sistence in exhorting me to proceed with the work, and for invaluable 
advice and assistance. 

There are many others who, in their various ways, have provided 
help and guidance. Amongst these are Professor S. H. Beaver, who 
read and commented on the text; Dr D. J. Bartholomew, formerly 
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lecturer in Statistics at the University of Keele, whose advice at an 
early stage helped to set the pattern of this book; Mr P. K. Mitchell, 
the Geography Department, the University College of Sierra Leone, 
a colleague during my year in Sierra Leone (1960-1961), when the 
bulk of this book was written; Miss J. P. Treasure of the Geography 
Department, the University of Liverpool, who drew most of the dia- 
grams; Miss E. M. Shaw, the University of Keele, for help at proof 
stage; and all my colleagues at Liverpool who willingly allowed me 
to try my ideas upon them. 

To them, and to many others, my thanks are due—I trust that they 
approve of the final product. 

SG. 

LIVERPOOL, 1962 


Vili 


CONTENTS 


PREFACE 
INTRODUCTION 


CHAPTER 1 
The nature of the raw material 


CHAPTER 2 

The calculation and use of the mean 
Types of mean 

Relationships between the means 
Specific examples 

Advantages and disadvantages 


CHAPTER 3 

Deviation and variability 

Types of deviation 

Specific examples 

Alternative methods of calculation 
Variability indices 


CHAPTER 4 

The normal frequency distribution curve and its uses 
Characteristics of the normal curve 

The ‘three standard deviations’ check 

Probability theory 

Probability and the normal frequency distribution 


CHAPTER 5 

Other frequency distribution curves 

Probability and the binomial frequency distribution 
Probability and the Poisson frequency distribution 


CHAPTER 6 

Characteristics of samples 

Sample and population parameters 

Sampling or standard error 

Best estimates, small samples and small populations 


CONTENTS 


Page 
Specification of sample size 84 
Standard error, sample size and the binomial frequency 
distribution 85 
CHAPTER 7 
Methods of sampling 90 
Methods of random sampling 90 
Methods of stratified sampling Se) 
Variable sampling fractions 107 
Systematic sampling 112 
CHAPTER 8 
The comparison of sample values—I 115 
Statistical significance 115 
Dispersion diagrams 117 
Standard error of the difference 121 
Student’s ¢ test 124 
Specific examples 126 
Comparison of coefficients of variation 130 
CHAPTER 9 
The comparison of sample values—II 133 
[The analysis of variance] 
The allocation of the variance 133 
Snedecor’s Variance Ratio Test 139 
Tabulated example 142 
Shorter method of assessment 146 
Samples of different sizes 147 
CHAPTER 10 
The comparison of frequency distributions 151 
[The Chi-squared Test] 
The calculation of Chi-squared 151 
Influence of sample size 157 
The comparison of two or more variables 159 
CHAPTER 11 
The problem of correlation 167 
Calculation of the product moment correlation 
coefficient 167 


xX 


CONTENTS 


Alternative method of calculation 
Further specific examples 

Correlation significance test 
Spearman’s rank correlation coefficient 
Comparison of the two coefficients 


CHAPTER 12 

Regression lines and confidence limits 
Straight-line regression for two variables 
Standard errors and confidence limits 
Further specific example 

Straight-line regression for one variable 
Straight-line regression for spatial change 
The exponential curve 


CHAPTER 13 

Fluctuations and trends 

The simple graph 

Running means 

Cumulative deviations from the mean 
Deviation from a trend line 
Rhythmic fluctuations 


CHAPTER 14 
Scope for the future 


A SHORT SELECTED BIBLIOGRAPHY 
FORMULAE INDEX 


GENERAL INDEX 


Page 
171 
174 
179 
181 
183 


185 
185 
190 
192 
195 
200 
203 


209 
209 
211 
214 
216 
224 


220 
228 
233 
Zoi, 


XL 


ACKNOWLEDGMENT 


We are grateful to the Syndics of the University Press, Cambridge, 
for permission to include adaptations of tables from Cambridge 
Elementary Statistical Tables by D. V. Lindley and J. C. P. Miller. 


Xii 


INTRODUCTION 


The type of geography which admits the importance of 
quantification and the appropriateness of statistical method- 
ology, but always as servants and not as masters, would 
appear to be the best answer the profession can furnish to the 
embarrassing questions which have arisen during the current 
debate in academic circles regarding geography’s right to 

be included in the curricula of institutions of higher learning. 


WILLIAM WARNTZ 


In this third quarter of the twentieth century it is increasingly appar- 
ent that the raw material with which the geographer deals is becoming 
progressively more of a quantitative nature and less merely qualita- 
tive. This gradual but steady change in emphasis has of necessity 
engendered a modification of the intellectual approach to the sub- 
ject. As in any other worthwhile field of study, so in geography each 
generation attempts to absorb, and then advance beyond, the accu- 
mulated work of previous generations; this is no more than the 
outward sign of healthy development. These advances may at times 
be in terms of factual knowledge. At other times, however, they 
reflect a changing approach to the subject at large, such as this pre- 
sent conscious and deliberate attempt to provide a more quantitative 
approach to the geographer’s problems. 

In all branches of the subject this tendency is developing. Climato- 
logical investigations have traditionally and necessarily been con- 
cerned with numerical data. Economic geography, too, has for long 
utilized quantitative data as a prime source of information, although 
explanatory studies have tended to rest more heavily on subjective 
judgments than would in many cases seem desirable. Geomorph- 
ology, population studies and various other aspects of human geo- 
graphy, amongst many branches of the subject, have also increasingly 
turned to more precise numerical data over the recent past, all in the 
attempt to render a more accurate and objective assessment of the 
geography of particular areas or problems. Moreover, as geographers 
increasingly co-operate with scientists from other disciplines, or 
engage in the practical fields of planning, the need to present both 
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data and conclusions in sound quantitative terms becomes even more 
pressing. 

Once such an attitude is accepted, however, a necessary corollary 
follows, that these numerical data should be analysed by sound 
statistical methods so that maximum value is obtained from them. 
Too often a considerable body of valuable quantitative data is pre- 
sented either in a raw state or after a minimal amount of processing. 
Sometimes, of course, this may be quite legitimate as it is all that the 
problem requires. In other cases, however, more fundamental, and 
possibly more valid, conclusions could be reached, or varied aspects 
of a problem investigated, by means of a more comprehensive and 
subtle use of existing statistical methods. Moreover, it is not simply 
that such methods are not used, but that at times false interpretations 
are made either because of the failure to apply such methods or 
because they are misunderstood. The latter may unfortunately arise 
when a geographer quite properly consults a professional statistician 
without at the same time fully understanding the implications of the 
results which are obtained by the methods with which he is provided. 

The aim of this book is therefore to present standard statistical 
techniques in a simple manner and to apply them to problems typical 
of those which geographers consider. In this way a twofold purpose 
is served. On the one hand the requirements of practising geographers 
engaged in research are at least partially met by the presentation of 
methods and techniques, at a relatively simple level, which should 
enable many geographical problems to be analysed more soundly. 
This is not intended to be a comprehensive work covering the full 
field of statistics, but rather a selective presentation of methods, 
which are particularly applicable to geographical problems. For the 
investigation of more complex problems the standard statistical 
texts, of which a selected list is given on p. 228, must be consulted. 
On the other hand, the introduction to relevant elementary statistics 
which this book will provide will enable all students of geography 
more readily to interpret and understand studies based on statistical 
analyses. Many of the misinterpretations which occur at present 
result from the reader’s failure to be conversant with either the ad- 
vantages or the limitations implicit in any writer’s statistical methods 
—this renders difficult the full and accurate assessment of the value 
and implication of what is written. From both viewpoints, therefore 
—from that of the geographer trying to analyse and present his 
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material more effectively, and that of the student of geography trying 
to interpret and understand existing studies—it is hoped that this 
excursion into statistical methods and their uses to geographers will 
prove of value. 

A fundamental difficulty arises here, however, and it is one which 
is inherent in the whole training which most potential geographers 
receive from their childhood onwards. Most aspiring geographers, 
though by no means all of them, have indeed studied mathematics to 
Ordinary level of the G.C.E. In far too many schools, however, it is 
either administratively impossible, or academically not permissible, 
to study both geography and mathematics together up to Advanced 
level. This lack of sixth-form training, or perhaps the actual training 
received prior to that date, tends to leave most prospective geo- 
graphers with a built-in resistance to anything which vaguely sug- 
gests mathematics. Directly (a + 5) is written on the blackboard, or 
a square root is required, a mental barrier is irrationally erected. 
This quite needless refusal to attempt to tackle such problems tends 
to nullify all attempts to put geography on a sounder footing in its 
handling of quantitative data. 

Throughout this book, therefore, the deliberate design is to lead 
the reader by the hand through these apparently difficult by-ways. 
Save where it is absolutely necessary, there is no attempt to delve 
into the mathematical theories behind the methods, but rather the 
concepts involved are presented in plain English instead of, or as well 
as, in symbols. The computational problems involved should not 
unduly strain the capabilities of any normally intelligent fifteen- 
year-old. What is required, on the other hand, is a conscious willing- 
ness to follow a statistical argument through to its logical conclusion, 
to breach this mental barrier of which I have written and in that way 
to discover an invaluable tool which has been neglected by geo- 
graphers for far too long. 

Thus this book is not designed for statisticians; nor does it claim 
to make statisticians of those who work their way through it. Many 
possible methods, or applications of methods, which could have been 
included have instead been deliberately omitted. Rather, a selection 
of useful methods that can be applied in the field of geography are 
presented, and illustrated in terms of problems which the geographer 
can understand. It is only in this application in terms of geography 
that the author can claim to have made any personal contribution, 
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for the methods presented are in common use in so many other dis- 
ciplines already, and explained—in greater or lesser complexity and 
clarity—in numerous other books (p. 228). It is to this wide range of 
Statistical texts at a more advanced level that the enthusiast or the 
specialist must turn, if the series of simple illustrations in this book 
stimulates him to further enquiry. It is not primarily as an introduc- 
tion to these more advanced statistical studies that this book is de- 
signed, however. If, instead, it succeeds in enabling geographical 
students to handle and interpret quantitative data more effectively, 
then the author will feel that it has more than fulfilled its purpose. 
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CHAPTER | 


THE NATURE OF THE RAW MATERIAL 


The methods and techniques used in the analysis of statistical data 
are in large measure controlled by the very character of the statistical 
data themselves. It is therefore necessary to begin with a very brief 
consideration of some of these characteristics so that the varied 
themes that will be introduced later will be more readily understood. 

When any collection of figures, representing some quantitative 
value of any given phenomenon, is to be processed it will be found 
that although such figures all represent the same phenomenon they 
are not all of exactly the same value. Thus if a study were being made 
of the distance inland from the coast that vessels of a given draught 
could sail it would be found that these distances vary markedly 
between one river and another, or between one part of the world and 
another. Again, if the number of vessels sailing along these rivers 
were examined a very wide range in values would be found between 
the different rivers. This highly variable nature of the numerical data 
is common, to a greater or lesser extent, to all sets of data, and this 
quantity which varies (mileage, or numbers of vessels, in the two 
cases given above) is known as the variate. 

A fundamental distinction must be made between two different 
types of variate, however. In the case of the navigable mileage of 
rivers outlined above, it is possible for any mileage value to be 
recorded and for fractions of a mile to be included. In other words, 
it is a continuous variate such that there are no clear-cut or sharp 
breaks between the values that are possible. On the other hand, the 
number of vessels actually sailing these rivers can only be in terms of 
whole numbers and fractions of a vessel cannot be recorded. Such a 
variate is known as discrete and special care must be taken when 
basing conclusions on the analysis of such discrete variates, as will 
be seen later. 

This variable nature of conditions can best be understood and 
appreciated if the data are plotted graphically to show the frequency 
of occurrence of values of different given amounts. The data are first 
grouped into ‘classes’, so that it is known how many occurrences fall 
into each of a series of quantitatively different sets of conditions. 
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Then the number of occurrences are plotted against the appropriate 
‘class’, and a diagram drawn in the form of ‘building blocks’. Such a 
diagram is known as a histogram and the pattern which it presents 
is called the frequency distribution for that set of data. From such a 
diagram a smoothed curve can be interpolated, this being known as 
the “frequency curve’ of that set of data. Thus in Fig. 1 can be seen 
the frequency distribution for population densities of the European 
nation-states. The values for individual states are grouped into 
various classes depending on their order of magnitude (e.g. 0-49-9 
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Figure 1. Histogram and frequency distribution curve for population densities of 
the European nation-states 


persons per sq. km.; 50-99-9 persons per sq. km.), and the variable 
character of these population densities is readily apparent. The wa 

in which these population densities vary is shown by both the ‘blocks’ 
and by the smoothed curve. A similar frequency distribution curve 
can be constructed for any and all sets of data. Fig. 2, for example, 
shows the distribution of hill summit heights in North Wales based 
on summit ring-contours taken from the O.S. 1 : 25,000 maps. As 
with the population densities, these summit heights are a continuous 
variate. Moreover, both Fig. 1 and Fig. 2 also display another 
feature of many distribution curves. It can be clearly seen that these 
curves are not symmetrical, having their peak markedly to one side. 
Such a distribution is known as skew, and the problems which this 
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introduces, together with various methods by which these problems 
may be largely solved, will be considered later. 

It is, in fact, mainly because of the variable character of sets of data, 
as well as the fact that the distribution curves which reflect this 
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Figure 2. Histogram and frequency distribution curve for hill-summit heights in 
North Wales 


variability tend to differ from each other in terms of skewness, that 
the whole need for sound statistical analysis of numerical geo- 
graphical data arises. If values for any given phenomenon were 
always the same most of the analyses would be unnecessary, for 
direct comparison of these unvarying values would usually be 
adequate. Another reason why careful analysis is required, however, 
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is that very often it is not possible to obtain data for the whole of 
the conditions with which one is concerned. Rather it is a matter of 
considering a sample of these conditions, working on the assumption 
that this sample provides a fair Tepresentation of the whole body of 
data (the latter being known as the statistical population). The extent 
to which this assumption is justified or not must therefore be allowed 
for when comparisons or judgments are being made. The various 
methods by which this can be done will also be considered later, 
although the need for it must always be borne in mind. 


CHAPTER 2 


THE CALCULATION AND USE OF THE MEAN 


The previous chapter has shown that sets of data are usually com- 
posed of individual values which vary from one another to a greater 
or lesser extent. When, however, it is necessary to express the quanti- 
tative aspect of such a set of data briefly and succinctly, a lengthy 
recital of all the individual values is not of much use. Even the 
graphical representation of these, as illustrated in Chapter 1, is not 
a great help, for it neither allows of a speedy and easy comparison 
between different sets of data nor of a ready expression of these 
characteristics in words or numbers. It is therefore often very useful 
to be able to summarize these varying values within the one set of 
data by one value alone. This one value is chosen so as to give as 
reasonable an approximation as possible to what is ‘normal’. It is 
immediately apparent that however this number is chosen it must 
involve certain generalizations and must also obscure many char- 
acteristics of the set of data that the distribution curve shows. 


Types of Mean 


Such a generalized summary of conditions can be obtained in 
various ways, and a few simple illustrations will show the differences 
between them. If a set of data were simply 


152;73,°4,'5 


and it were necessary to summarize these values, most people would 
carry out such a summary in the following way. They would prob- 
ably add the numbers together, getting a total of 15, and then divide 
this by the total number of items, i.e. by 5. In this way they would 
arrive at an answer of 3. On the other hand it would also be possible 
to arrange the values in order of magnitude—as they are already 
arranged above—and choose the middle one as being representative 
of them all. Again, the answer would be 3. 

With a rather more complex set of data the same approach could 
be adopted. For example, the data may be as follows: 
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In this case the total of these several values is 30 and if this is divided 
by the number of values involved, i.e. 10, the answer is again 3. 
Also, if the method of choosing the central value when they are 
arranged in order of magnitude is used, then 3 is again the answer. 
In this case, another method also presents itself, for it is possible to 
proceed as for the preparation of a distribution curve and group the 
data into sets or classes in the following way. 


Number of 
Value Occurrences 


AP WN = 
mNAN eR 


Having thus grouped the data according to the number of occur- 
rences of any one value it is possible to choose that value which occurs 
most frequently. Here it is once again the value 3. 

It is these three methods, presented here in all their simplicity, that 
are the three basic ways of summarizing a set of data. Each of these 
methods gives a value which to someextent represents the set of data. 
This value is sometimes referred to as the ‘normal’ or ‘norm’, but the 
more usual term for it is the ‘mean value’ or, more simply, the ‘mean’. 

The first method applied above is the arithmetic average or the 
arithmetic mean, usually more loosely referred to as either the mean 
or the average. As was seen when the method was first used, the 
average is simply obtained by adding the values together and then 
dividing by the number of values that there are. It can be defined 
more precisely—and apparently more technically—by stating that 
‘the average is a quotient obtained by dividing the total by the number 
of occurrences connected with it’. 

To express this relationship in mathematical terms is quite simple 
once the ‘shorthand’ used is memorized. Thus the above statement 
can be written as: 
2x 

n 


3B eS 


In this statement 


x = the individual values making up the series of data 
X = the average of that series 
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= the number of occurrences being considered 
& = the summation of all the values of x, i.e. the total when all 
values of x are added together. 


Thus in a few signs the rather long definition of the average given 
above can be summarized easily. 

The second mean value which was obtained in the earlier examples 
is called the median. This was obtained by placing the values in 
ascending or descending order of magnitude and then finding the 
central value of these. If the total number of occurrences were to be 
an odd number then the median would be one of the observed 
values. If, on the other hand, there were an even number of occur- 
rences, then the median would lie midway between two of these 
values. These differing sets of conditions are to be seen in the two 
simple examples with which this consideration was begun. Once 
again, however, it is as well to define terms as precisely and un- 
ambiguously as possible and to state that ‘the median is the reading 
on the scale of the variable such that there are an equal number of 
entries above it and below it’. 

The third mean which was considered, in which the data were 
grouped into classes and the class containing the most occurrences 
was chosen, is referred to as the mode, i.e. the most fashionable; the 
one which occurs most often. This again can be presented in terms of 
a formal definition, such as that ‘the mode is the value of the class 
within a statistical group in which there are most incidences’. 


Relationships between the Means 


These three—the average, the median and the mode—are the 
main methods of expressing the mean value of any set of data. It 
would therefore seem desirable to consider briefly the relationship 
between them and to try to assess which, if any, is preferable to the 
others, and why this may be so. This can first be done by considering 
one of the earlier examples in a slightly different way. If the series 


bi 223684133188 S5 4,:45)9 
were to be plotted as a histogram it would appear as in Fig. 3, where 
a smooth frequency distribution curve is also drawn to fit these data. 
In this particular example, as has been seen above, the three mean 
values all coincide at the same point, i.e. 3. Also the accompanying 
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frequency distribution curve is perfectly evenly balanced on either 
side of these mean values. Such a symmetrical distribution curve is 
referred to as a normal distribution curve; equally, with a normal 
distribution this perfect coincidence of the three mean values always 
occurs. The existence of such a normal distribution is assumed in 
most statistical methods, although in practice it is seldom perfectly 
achieved, as was indicated in Chapter 1. 
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Average 2:7 
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Figure 3. The normal distribu- Figure 4. A positively skew dis- 
tion curve tribution curve 


The following set of data illustrates a non-normal distribution: 
Ly 2s 12,123 ARS SS 
The three means can be calculated as below: 


D2 ann? | 
(a) the average = ¥ = a Pat 
For the median and the mode the values can be retabulated 
values: Pel p2 22933744 85 
occurrences: 2 3 2 ee al 
(b) median 2:5 
(c) mode 2 


Thus in this case the three means are different from each other, the 
average being the largest value and the mode the smallest. In Fig. 4 
these values are plotted on a histogram and the distribution curve is 
added. Clearly this distribution curve is NOT evenly balanced, having 
its ‘peak’ to the left of centre and a ‘tail’ to the right. This lack of 
balance is called skewness, while when the ‘tail’ extends to the right, 
as in this case, it is classed as positive skewness. 
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On this diagram are also entered the three means, the mode lying 
to the left, the median in the centre and the average to the right. This 
relative pattern of the three means is true of all positively skew 
distributions. Moreover, provided that the skewness is not too 
marked, a general quantitative relationship also tends to exist 
between the means. 

This relationship, which gives only a general approximation, can 
be expressed as follows: 


MODE = AVERAGE — 3(AVERAGE — MEDIAN) 


i.e. MODE = Fat | — 3(2:7 — 2:5) 
2h) = 8B K:02) 
Satepe! 27 — 06 = 2:1 


In fact the modal value is 2-0. Expressed in other terms, it means 
that the median lies one-third of the way back from the average 
towards the mode (Fig. 5). 
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Figure 5. Relations between the means in a 
skew distribution 
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Figure 6. A negatively skew dis- 
tribution curve 


It is equally possible for distributions to be negatively skew, i.e. 
for the ‘tail’ to lie to the left of the curve and for the peak to lie to the 
right. This is exemplified in the following case (Fig. 6) 
values: 192, 2: 3,3, 4-4, 4,5, 5 
occurrences: 1 2 2 3 Z 


average = = = 3-3 median =3-5 mode = 40 
The general relationship of the mode, the median and the average 


still holds true but in the reverse direction, 
i.e. MODE = AVERAGE + 3(MEDIAN — AVERAGE). 
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Specific Examples 


Having outlined these methods in abstract terms they should now 
be considered in relation to specific data of geographical interest. 
In Table I are set out the annual totals of rainfall at Bidston Ob- 
servatory, Birkenhead, for the thirty years 1901-1930. These vary 
between 22:47” and 36:50” and these data represent a continuous 
variate. On calculating the average it is found that 
SX SO3°03 

a e30 
In the second column of Table I these values are retabulated into a 
descending order of magnitude. As there are thirty values, the median 
will lie between the fifteenth and sixteenth values, ie. midway 
between 28-08” and 28-45”, giving a value of 28-27”. In the third and 
fourth columns of Table I, the values are grouped into several 

Modal 


Class 
10 25-2699 = 


Median 28-27, 
Average 28°45 


Number of occurrences in each graded class 


19-20-99 
21-22-99 
23-24 -99 
25-26 -99 
27-2899 
29-30-89 
31-32-99 
33-34 -99 
35-36 -99 
37-38-99 


Graded classes of annua 
(in inches) 


rainfal| 


Figure 7. Histogram and frequency distribution curve for annual rainfall at 
Bidston, 1901-1930 
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Table I 


Annual rainfall at Bidston Observatory, Birkenhead, 1901-1930 


Values in order 
of occurrence 


x 
(inches) 


Zo 
25°57 
34-42 
25-18 
24-01 
28-08 
26°57 
28-90 
28°45 
28°59 
25:27 
30°17 
25°78 
26:02 
26°83 
24-87 
30°59 
31:93 
29-12 
33-34 
22:47 
25:97 
30-92 
32°87 
28-00 
28:95 
34-81 
29-11 
21> 
36°50 


30)853-63 
Average 28°45 


Values in order 


of magnitude 


(inches) 


36°50 
34-81 
34-42 
33-34 
32-87 
31°93 
30-92 
30-59 
30-17 
29-12 
pI 
28-95 
28-90 
28°59 
28°45 
28-08 
28:00 
26°83 
26°57 
26:02 
25:97 
25°78 
D551 
25:27 
2A 
25:18 
25:25 
24-87 
24-01 
22:47 


Median 
28:27 


No. of 
Classes occurrences 


(inches) 


21-22-99 
23-24-99 
25-26:99 
27-28-99 
29-30-99 
31-32-99 
33-34-99 
35-3699 


Mode = 25-26-99 


a 
RP WNUANAON & 


classes and the number of occurrences in each class is shown. As 
this is a continuous variate, all possible numerical values must be 
allowed for, not simply whole numbers. The class limits must therefore 
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be designed to provide a fully continuous range of values, i.e. not 
21 & 22; 23 & 24 etc. but 21 to 22-99; 23 to 24-99 etc. In this way 
a modal class of 25’—26-99” is found to occur. These conditions are 
represented graphically in Fig. 7 where slight positive skewness is 
shown. 

Differences in frequency distributions and in the relationship 
between the three means can also be appreciated by considering data 


5 au 
u SO 
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Figure 8. Histograms of annual iron-ore production for Belgium, France, Luxem- 
bourg and the United Kingdom, 1938-1957 


related to economic geography. In Table II are set out the annual 
iron-ore production figures for the twenty years 1938-1957 for four 
western European countries— Belgium, France, Luxembourg and the 
United Kingdom. The average and median values can be readily 
obtained, in the same way as for the Bidston rainfall data, and these 
are given at the foot of each column. In every case the median is 
higher than the average, suggesting a tendency for negative skewness, 
though in the case of France this does not seem to be borne out by 
the frequency distribution (Fig. 8). Fig. 8 also displays the difficulty 
of establishing a clear-cut mode in many sets of data. 
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Table IT 
Annual iron-ore production 1938-1957 (in thousands of tons) 


Belgium France Luxembourg U.K. 
65 10,203 1,506 3,615 
60 10,161 1,639 4,417 
29 4,113 1,368 5,449 
47 3,467 RS) be 5,528 
41 4,144 1,431 5,449 
46 5,350 1,471 5,411 
16 2,862 816 4,390 
11 2,349 394 4,162 
14 5,021 650 3,574 
21 6,099 592 2,974 
34 7555 1,020 3,990 
15 10,200 1,241 4,086 
16 9,750 1,154 3,812 
28 11,440 1,688 4,504 
47 13,230 2,174 4,618 
35 13,790 2,151 4,500 
29 14,240 1,766 4,369 
37 16,340 1933 4,437 
50 17,120 2,034 4,457 
48 18,770 2,036 4,637 
Average 34-45 9,310-2 1,448-8 4,418-95 
Median 34:5 9,955°5 1,488-5 4,427-0 


Advantages and Disadvantages 


The advantages and disadvantages which the three types of mean 
possess as working tools can now be more generally considered, 
following these illustrations from actual conditions. The mode, by 
its very definition, indicates that which is most common or frequent. 
Very often, however, there is some difficulty in deciding exactly 
where the mode occurs. This difficulty can arise for one of two 
reasons. First, the distribution may not be unimodal, i.e. it may well 
have two or more modal groups of roughly equal importance. Thus, 
when considering the iron-ore data (Table II) it was seen that it 
was difficult to establish a clear mode, especially for Belgium and 
Luxembourg. In each of these cases two classes of equal frequency 
exist in all suitable broad groupings of the data (Fig. 8). The second 
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difficulty arises from the selection of the classes which are to be 
adopted. 

If the Bidston rainfall data, for example, were to be grouped in 
terms of classes beginning 22’-23-99”", instead of 21”-22-99" as in 
Table I and Fig. 7, the following frequencies would be found: 


Classes No. of 

(inches) occurrences 
22-23-99 
24-25-99 
26-27-99 
28-29-99 
30-31-99 
32-33-99 
34-35-99 
36-37:99 


RP NN BOW OR 


The modal class would thus become 24”-25-99” (instead of 25” 
26-99"), while the frequency distribution would appear to be virtually 
duomodal. These difficulties mean that in practice the mode is a very 
imprecise form of the mean value; it may be difficult to locate and 
the actual value arrived at may in part result from a subjective choice 
of groupings. Furthermore, the mode does not possess any true 
mathematical qualities having at best only a generalized relationship 
to the average (p. 9), so that it cannot be used in formulae to derive 
further characteristics of the set of data. Save for graphical purposes 
(and also for its use in certain generalized computations to be out- 
lined later) the mode is not a method to be highly recommended. 
When the median is used as the mea value, it can be considered 
as representing the ‘mean expectation’, in that there are as many 
individual occurrences above it as there are below it. Moreover, in 
the calculation of the median every occurrence is given the same unit 
weight, i.e. it is regarded as of equal importance, whether it be of 
small, medium or large magnitude. Indeed, magnitude of individual 
values is of no importance directly, except for that of the central 
value when there is an odd number of occurrences being considered, 
or of the central two values when there is an even number of occur- 
rences. This means that widely differing sets of data can return the 
same median value, as is indicated in a generalized way in Fige 9 
Furthermore, this implies that the median possesses no real mathe- 
matical qualities and cannot be used for further computation except 
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in the most general manner. In this way it suffers from the same 
limitation as does the mode. Nevertheless, the median does possess 
the valuable property of clear definition, in that its relative position 
within the occurrences is undisputed and 


readily understood, while it is also very 
useful in illustrative material. 

Of the three types of mean which have 
been considered, it is only the average which 
is based on sound mathematics, and which 
therefore possesses properties which permit 


O 
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Increasing values 


its use in further calculations. Nevertheless, 


it is essential that the implications and 
limitations of the average also be appreci- 


Figure 9. Relationship of 
the median to two sets of 
data 


ated. In the calculation of the average, 

weight is given to each occurrence according to its magnitude, in 
that all occurrences and their order of magnitude are used in its 
computation. Thus the extreme values are excessively stressed in 
comparison with the middling values. In a distribution which 
approximates to the normal (Fig. 3) this is of minor importance at 
the most, and in each of the three idealized distributions con- 
sidered earlier (pp. 8-9)—normal, slight positive skewness and 
slight negative skewness—half of the occurrences exceeded the average 
and half were less than it. In cases of marked skewness, however, 


Mode 0 
Median 0:5 


Average 4-0 


Saree Srey — 


ss 
——" 


Occurrences 
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Rainfall groups 
(in inches) 


Figure 10. Frequency distribu- 
tion curve and mean values for 
annual rainfall of a desert area 


this will not be so. The following values 
may represent the annual falls of rainin 
inches in a desert area over ten years, 
and the resulting distribution curve is 
seen in Fig. 10. 


Fall (in.) = 0, 1, 0, 0, 10, 2, 25, 0, 0, 2; 
total = 40 in.; average = 4-0 in. 


Thus it can be seen that the average 
rainfall of 4” was exceeded only twice 
in the ten years, while in the other 
eight years the rainfall was below the 
average. This is because the two wet 


years when the falls were 25” and 10” have each made a greater 
contribution to the total rainfall and have therefore affected the 
average value far more than have each of the more common years 
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when rainfall was nil. In this particular example the median would 
be 0-5” and the mode 0-0’, both of which give a better direct indication 
of the conditions which are most typical. Nevertheless, this does not 
imply that the average is useless or needlessly misleading in such 
cases, for any misinterpretation of the 4” average value is a result of 
a failure by either the writer or the reader to appreciate what the 
average value really is and how it is calculated. On the other hand, 
this characteristic does illustrate one of the limitations of the average 
in relation to skew distributions, and also the possibly misleading 
character of any mean value when it is used alone. For a proper 
appreciation of the significance and relevance of any mean value it 
is also necessary to know something more of the distribution which 
the mean summarizes, e.g. it is desirable to know how actual con- 
ditions are ‘scattered’ around the mean value. It is the various 
methods by which this can be done, and their implications, to which 
attention must be paid in the next chapter. 
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CHAPTER 3 


DEVIATION AND VARIABILITY 


The fact that in any set of data the actual values differ from one 
another, and also from the mean value itself, has been stressed several 
times in the foregoing pages. A necessary corollary is that for a true 
and worthwhile understanding of the mean value of a set of data it 
must be possible to associate that mean easily and readily with some 
measure of the degree of scatter about that mean. If this can be 
achieved then the utility of the mean value is greatly increased and 
many further deductions can be made concerning other properties of 
the set of data under consideration. Such applications in the field of 
geography will be presented in succeeding chapters. 


Types of Deviation 


If data are being presented graphically the simplest and most effec- 
tive indication of scatter is provided by the frequency distribution 
curve, while a dispersion diagram in which each value is indicated is 
also useful. If scatter is to be expressed in numerical terms, however, 
these will not be applicable. One rough-and-ready way in which 
scatter can be expressed is in terms of the highest and lowest values 
occurring in the record. For example, the following two sets of 
figures both have the same average value, i.e. 5. 


seti 1, 3, 5, 7,9 average = 5 

setii 3, 4, 5, 6, 7, average = 5 

Clearly the scatter about this average is different in the two cases and 
the ranges of values involved will give a very generalized idea of this. 
Thus it could be said that ‘set i? has an average of 5 and a scatter 
from 1 to 9, while ‘set ii’ has an average of 5 and a scatter from 3 to 7. 
Although this is helpful in its own way, it is very imprecise. More- 
over, it does not provide a summary of the scatter in one value only, 
which is desirable if statistical analysis is to be effective. It is there- 
fore to methods which satisfy these conditions that attention must 


now be given. 
Three more accurate and useful methods of summarizing scatter 
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are commonly employed, these methods yielding numerical values 
which are referred to as ‘deviation’ values, i.e. values representing 
differences from the mean. Two of these methods can be used with 
the average, while one can be used with the median. If the mode is 
the type of mean value being employed, no effective deviation value 
can be presented, which is another disadvantage in the use of this 
value. 

If conditions are being presented by the median then scatter can 
be summarized by the quartile deviation. This is derived just as simply 
as is the median itself, and it equally possesses the same advantages 
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Figure 11. Graphical calculation of the median and quartiles for annual rainfall 
at Bidston, 1901-1930 


and disadvantages as the median. The median was obtained (p. 7) 
by dividing the record into two equal parts as regards number of 
occurrences, this being effected by an inspection of either the figures 
themselves or a graphical plot of those figures. The two halves of the 
record, above and below the median, can then each be divided into 
two so that the overall record is divided into four groups of equal 
number of occurrences. The new dividing lines are called the Upper 
and Lower Quartiles, the former separating the 25% of the record 
with the highest values from the rest, and the latter similarly separa- 
ting the 25% with the lowest values. In Fig. 11 the values for rainfall 
at Bidston Observatory, which were set out in Table I, are presented 
graphically in order of magnitude. On this graph are entered the 
median at 28-27” as was obtained on p. 10, and also these two 
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quartile values. Thus above the median there are 15 values of which 
the central one is the eighth from the top, so that the upper quartile 
is 30:59”. Again, there are 15 values below the median, the central 
one of these being the eighth from the bottom, i.e. the lower quartile 
192557". 

These new quartile dividing lines enclose within them the central 
50% of the occurrences. The difference between the top and the 
bottom of this central 50% is called the inter-quartile range which for 
the example in Fig. 11 is 30-59” — 25-57” = 5-02”. This range lies 
athwart the median, and if the distribution curve were normal, i.e. 
symmetrically balanced, then each of the quartiles would lie half this 
distance away from the median, i.e. 2-51” away in the above example. 
It is this value, which gives an indication of the range of the central 
50% of the occurrences above and below the median, that is called 
the quartile deviation. It may thus be expressed as 


upper quartile — lower quartile 
2 


and can be described as the mean expectation of the deviation from 
the mean. In other words, half the occurrences differ from the mean 
(median) by more than this amount and half differ from it by Jess 
than this amount. 

This is a very useful method, providing an easily-obtained value 
which possesses some clear meaning. On the other hand, it is still 
not really a true measure of overall scatter of the occurrences about 
the mean, for as in the case of the median the order of magnitude 
of the occurrences other than those specifically associated with the 
critical values, i.e. the quartiles, is not considered at all. It is only the 
existence of a given number of occurrences between or beyond certain 
points that is taken into account, not their order of magnitude. This 
characteristic once again renders the median/quartile system of only 
limited use for further computations. Nevertheless it is a valuable 
illustrative device which is widely used especially in the presentation 
of climatic data. The 10 mile to 1 inch rainfall map of the British 
Isles, prepared by the Meteorological Office for the Ministry of 
Town and Country Planning, provides an excellent example. 

It was stressed in the previous chapter, however, that of the three 
means it is only the arithmetic average which is mathematically 
sound, and measurements of scatter in relation to it therefore need 
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consideration. This can first be done in terms of a simple example. 
In the short set of data given below the average of the six values is 
3°58 


ya | 
6+5+44+3424+1=21 Average (%) = & = 3:5 


A simple way of assessing the scatter of these values about this aver- 
age is first to find out by how much each occurrence differs from 
(i.e. deviates from) this average value. These individual differences 
or deviations can be tabulated alongside the values themselves, as is 
done below. Once this has been set out it is a simple proposition to 


Values (x) Deviations (d) 
6 2°5 
5 1-5 
4 0-5 
3 0-5; 
D 1:5 
f 25 
6)21 6)9: 
= 3S) 1:5 


calculate the average amount by which individual values deviate 
from the mean. In other words, this gives the mean (average) 
deviation from the mean (average), and is known as the Mean 
Deviation. It is apparent, however, that no consideration has been 
given to the direction of these individual deviations, whether they 
be above or below the average. Instead, this question of direction or 
‘sign’ (+ or —) has simply been ignored despite the fact that the 
sign is part of the mathematical quality of the deviations. This fact 
is recognized in the stricter definition of the mean deviation which 
can be presented as ‘the average difference between various measure- 
ments and the central mean value, irrespective of sign’. This is 
written as follows: 

Mean deviation = sila A 


Here the fine vertical lines indicate that for this purpose it does not 
matter whether the value x is greater or less than the average X, i.e. 


it is the difference between them irrespective of sign which is summed 
and averaged. 
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This ignoring of the sign is very convenient, making for simple 
calculation and easy understanding of the meaning of the resultant 
value. Nevertheless it is improper mathematically, for the sign is 
necessarily an integral part of the value and if it is to be removed 
or standardized this should be effected by mathematical means. 
Therefore, before illustrating the mean deviation in terms of an 
actual set of data, it is desirable to outline the method by which the 
sign may be properly dealt with, so that the two methods can then 
be compared. 

If the above example is reconsidered, a different way of removing 
the sign can be seen. This is by means of squaring all of the differ- 


x 
6 ge 6:25 
5 1:5 225 
4 +0:5 0:25 
3 —0°5 0:25 
Z —1:5 225 
r —2°5 6:25 
XxX = 3:5 6)17-50 
2-917 


V2-917 = +/—1-7 


ences, when the sign thus becomes positive in all cases. This is 
shown in the example under the column headed d?. Then, as with the 
mean deviation, these several individual deviations are summed and 
averaged. This value—the average of the squares of the deviations 
from the average—is known as the Variance of the set of data, a 
parameter (or characteristic) to which reference will frequently be 
made in later sections. It represents the average amount of deviation 
from the mean, the negative signs having been changed to the 
positive by mathematical methods. A deviation value, however, 
purports to summarize differences from the mean in both a positive 
and a negative direction. This feature can be introduced by finding 
the square root of this variance, for the square root of any number 
has both a positive and a negative value, ie. V4 = +2 or —2. So 
in the above example, while the Variance is 2-917 the deviation value 
is + / —1-7. Such a value is known as the Standard Deviation, or 
sometimes as the ‘root mean square deviation’, the latter really 
explaining how it is calculated. Thus a full definition would be that 
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‘the standard deviation is the square root of the average of the 
squares of the deviations from the arithmetic average’. This para- 
meter is usually written as the Greek letter ‘sigma’—o, while the 
variance, the square of the standard deviation, is written as o*,-The 
relationship between variance and standard deviation is fundamental 
to many later discussions and formulae, so it is essential to re- 
member it. It is clearly apparent if the two formulae are written above 
one another: 
7)2 

variance (a?) = eae 
standard deviation (o) = MY eis 


n 


The symbols used here are the same as have been used earlier, and 
careful working through the formulae in terms of the explanations 
given above will clarify what they mean. 


Specific Examples 


The application of these methods to some of the data which were 
considered in the previous chapter will also illustrate both the 
methods of calculation and some of the properties of the resulting 
deviation values. Their value in analysing geographical problems 
will then be outlined in succeeding chapters. Thus in Table III the 
rainfall data for Bidston Observatory are analysed to obtain both 
mean deviation and standard deviation. The methods used are those 
presented in the simple example above, and values of 2:79” and 
3-45” are obtained for the mean and standard deviations respectively. 
The difference in magnitude between these is quite typical, the 
standard deviation being about 25% larger than the mean deviation. 
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Table IIT 


The calculation of mean and standard deviation for annual rainfall 
at Bidston Observatory, Birkenhead, 1901-1930 


Deviation 
Value Deviation squared 
x d d? 
25:19 —3-26 10-6 
25°57 —2-88 8:3 
34-42 +5-97 35-6 
25-18 —3:27 10-7 
24-01 —4-44 19-6 
28-08 —0:37 0-1 
26°57 —1-88 3:5 
28-90 +0:45 0-2 
28-45 0:00 0-0 
28°59 +0:14 0-02 
25:27 —3-18 10-1 
30-17 Seileype 2-95 
25-78 —2:67 71 
26:02 —2:-43 5-9 
26°83 —1-62 2-6 
24-87 —3-58 12:8 
30-59 +2:14 4-6 
31-93 +3-48 12-1 
29-12 +0-67 0-45 
33-34 +4-89 23-9 
22:47 —5-98 35-7 
25:97 —2:48 6°15 
30-92 +2:47 671 
32:87 +442 19-6 
28-00 —0-45 0:2 
28-95 +0-50 0:25 
34-81 +6:36 40-5 
29-11 +0-66 0-4 
25°15 —3-30 10-9 
36-50 +8-05 65:0 
30)853-63 30)83-71 30)355-92 


Ave. = 28-45 Mean deviation = 2:79 


Variance = 11:86 = o? Eis 
Standard deviation = V11-86 = 3:45 =o 
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Table IV 


Relationship between mean and standard deviations for iron-ore 
production in Belgium, France, Luxembourg and the United 
Kingdom, 1938-1957 


Mean Standard sD 
Country Average deviation deviation MD 


(thous. tons) (thous. tons) (thous. tons) 


Belgium 34:45 1315 15:55 1:18 
France 9,310-2 4,283-2 4,960:0 1-16 
Luxembourg 1,448°8 437-3 527-2 1:20 
United Kingdom 4,418-95 480-1 656°5 1:37 


This approximate relationship, i.e. standard deviation = 1-25 mean 
deviation, is almost perfectly fulfilled in the case of the Bidston 
rainfall, for the factor by which the mean deviation must be multi- 
plied to give the standard deviation proves to be 1:24. The relation- 
ship for annual rainfall in the British Isles, based on 230 stations, is 
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o 
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Figure 12, Graph of mean deviation against standard deviation values for annual 
rainfall for 230 stations in the British Isles 
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shown in Fig. 12. Such a close similarity between actual and theoreti- 
cal conditions does not always apply, of course, as the values in 
Table IV show. These values are for the iron-ore production figures 
which were earlier presented in Table II. It can be seen that the 
standard deviation is invariably greater than the mean deviation, but 
the proportions in these four cases vary between 1-16 and 1-37. Both 
these features occur because during the process of squaring the 
individual deviations and then taking the square root of the sum of 
these squares, the larger deviations carry increased weight, while the 
smaller are given somewhat decreased weight. This means that the 
standard deviation will always be the greater of the two and that the 
degree of difference will be controlled by the relative frequency and 
magnitude of large and small individual deviations. 


Alternative Methods of 
Calculating Standard Deviation 


It is the standard deviation which is the soundest indication of 
scatter in mathematical terms, and it is essential for other computa- 
tions and formulae, as will be seen later. The considerable labour 
involved in its calculation is, however, something of a problem and a 
nuisance. Any short cut in the process of calculation is therefore to 
be welcomed, and there are two possibilities of doing this. The first 
of these is based on an algebraic modification of the formula, so that 
the number of individual calculations is decreased. This not only 
saves time but also reduces the possibilities of error. 

The formula for the variance has been shown to be the following 
(p. 22): 

2 (x —" x)" 


n 


o2 


The major component of this, and the portion that involves the bulk 
of the calculations, is (x — *)?, and it is possible to write this out 
in full in the following manner: 


(x — %)2? = (x — X) (x — %) = x? — 2x 4+ 

This therefore allows the formula for the variance to be re-written, i.e. 
ee es ea Mere oak 

= — = 
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Thus each separate part of the expanded version of o? can be 
summed and then divided by the number of occurrences. Although 
this is apparently a more complicated and confusing version, it is 
nevertheless possible to simplify the individual components. On 
p- 6 it was shown that the summation of x over n gives the 
average, i.e. 

Zs 


n 


=X 


SMa 5 
and therefore x can be substituted for —. Again, as X is the average 
n 


it is bound to be a constant, i.e. always the same, in the one formula. 
Therefore if x is added up n times and then divided by n, the answer 
must be X, i.e. 

ax 


n 


=x 


It is now time to attempt the simplification of the formula for the 
variance in the following way, by the substitution of < for both 


>» x 
gees peatter Sh 
n n 
raw. baie 
SSS — 2*%.x% + x? 
n 
¥ x2 
= = _ 2524 32 
es 
n 


Furthermore, as the standard deviation is simply the square root of 
the variance, then the standard deviation formula can be written 


This involves far fewer calculations. Each individual occurrence is 
squared, these values are summed and averaged, and then the Square 
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of the average of the occurrences is subtracted. Finally, the square 
root of this must be obtained to change it from the variance to the 
standard deviation. 

A practical example will make a clearer distinction between the 
two methods. Suppose that a study is being made of the sphere of 
influence of a particular town. Amongst the aspects of this that 
might be studied could well be the frequency of train services to 
neighbouring centres of population. From such a study assume that it 
was found that 25 such centres were served, and the number of 
trains per day to each of these centres were as set out in the second 
column of Table V. By simple calculation it could be found that the 
average number of trains per day between the town being studied and 
any neighbouring centre was 9-6. Apart from the need for careful 
use of such a figure because the set of data consists of a discrete 
rather than a continuous variate, it would also be useful to know the 
scatter of the values about this mean, preferably in terms of the 
standard deviation. 

On the right-hand side of Table V the variance is calculated by the 
x (x — x)? 


first of the formulae to be presented above, i.e. by 0? = 
n 


(METHOD 1), while on the left-hand side the second of the formulaeis 


2 
ae xX? (METHOD 2). As can be seen, they both 


used, i.e. o? = 


give the same variance value, i.e. 26-0, so that the standard devia- 
tion in each case is 5-1. The number of calculations involved is mark- 
edly different, however. In Method 1 there are 25 subtractions, 
25 squares, 1 addition, 1 division and 1 square root—a total of 53 
operations, each of them a source of possible error and a consumer 
of time. In Method 2 the total number of calculations is reduced to 
30, i.e. 26 squares, 1 addition, 1 division, 1 subtraction and 1 square 
root. Moreover, until the final phases the values involved are all 
whole numbers without decimals. This is only true, however, because 
the problem is concerned with a discrete rather than a continuous 
variate, although this will not always be the case. On the other 
hand, the size of the numbers involved in Method 2 can prove 
to be very large indeed. The method is most valuable, in fact, 
when some form of calculating machine is available, for even 
with a small hand-operated desk adding machine it is possible 
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to calculate both the average and the standard deviation at the 
same time. There is no equivalent short cut in the mechanical 
handling of Method 1. 


Table V 


The calculation of the standard deviation by two methods using data 
concerning the number of trains per day between one town and 
neighbouring towns 


(METHOD 2—p, 26) (METHOD 1—>p. 22) 
No. of 
Occurrences trains Difference 
squared per day Difference squared 
x x (G2 = 89) (x — x)? 
1 1 — 86 73:96 
4 2 — 76 57-76 
9 3 — 66 43-56 
9 3 — 66 43-56 
16 4 — 56 31-36 
25 5 — 46 21:16 
36 6 — 36 12:96 
36 6 — 36 12:96 
64 8 — 16 2°56 
64 8 — 16 2:56 
64 8 — 16 2:56 
100 10 + 04 0:16 
100 10 + 04 0-16 
100 10 + 0:4 0-16 
100 10 + 0:4 0-16 
121 11 + 1:4 1-96 
121 11 + 1-4 1-96 
144 12 + 2:4 5-76 
144 12 + 2-4 5-76 
196 14 + 44 19-36 
225 15 Sie 4 29-16 
225 15 + 5:4 29°16 
289 17 j= 4. 54-76 
361 19 + 9-4 88-36 
400 20 + 10-4 108-16 
25)2,954 25)240 25)650-00 
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Ex : 
a 18-16 9-6 Se Ne ae 
n n = 
Kei==,. 99-16 
2 x? 


= ae X* = 118-16 — 92:16 = 26-0 

By both methods the variance is shown to be 26-0 and therefore the 
standard deviation is the square root of this 

Le. o = 5-1 


If even the simpler mechanical aids are not available to speed up 
the work, then a long series of data may be analysed by a more 
generalized method which gives an answer approximating very 
closely to the correct one. Moreover, this method also allows for the 
calculation of the average in the same generalized way and at the 
same time. The method involves certain ‘rounding’ or simplifying 
processes which may at first seem arbitrary and unjustified, but the 
proof of the general accuracy of the method can be readily demon- 
strated by a practical example—in fact, by reworking the data which 
have just been analysed by the two exact methods. Mathematical 
proofs of the adequacy of this method are also possible, but will not 
be presented here—the important thing is to become familiar with 
the technique simply as a useful tool. 

The tabulation, formulae and calculations necessary are set out in 
Table VI which should be followed carefully in connection with the 
explanation which follows. The first task is to group the data into 
classes or cells, as is done in the preparation of a histogram (p. 1). 
It is essential, however, that for this purpose the range of values in 
any one class should be small and that the number of classes should 
be at least 10. If these two conditions are not satisfied the margin of 
error introduced by the generalization may well be too large to allow 
the answers to be of any real use. In the present example, where the 
variate is discrete and all the data must be in the form of whole 
numbers, it is in many ways adequate simply to list the ‘class marks’ 
which are shown at beginning of Table VI. The classes here are 10 in 
number, each of them consisting of two numbers. If the example 
were a continuous variate then the classes would need to cover all 
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contingencies, and class boundaries would need to be carefully 
defined. Even in a case such as the present it is desirable to establish 
class boundaries, as this facilitates the correct interpolation of the 
class mid-marks. A certain amount of care and thought is required 
when these class boundaries are being defined. Thus in Table VI it 
could be argued that all values between 0-5 and 1-0 are rounded to 1, 
and that all values between 2-0 and 2:5 (but not including 2:5 itself) 
are rounded to 2. Therefore the boundaries of this class are from 
0:5 to 2:5 and those for this and all the other classes are shown in 
the second column in Table VI. The correct definition of the class 
mid-mark is now easier, as it is the central value within the class 
boundaries. In the present example these mid-marks are thus 1:5, 
3-5, 5:5 etc. up to 19-5. These now become the values of the occur- 
rences with which these calculations are made, and they are entered 
in the third column under the heading x to indicate this. With these 
entered, it is easy to see the magnitude of the difference between 
mid-marks, this being known as the ‘cell interval’ and written as c— 
here it is 2. Finally in this preparatory tabulation it is necessary to 
enter in the fourth column, under the heading f| the number of 
occurrences falling within the boundaries of each class, i.e. the 
frequency of occurrence needs to be obtained, the total frequency 
xf being entered at the bottom of the column. 

It is with the cells, mid-marks and frequencies which are thus 
established by simple inspection of the data that this computation of 
average and standard deviation values is concerned. The actual 
values themselves, and the possible errors of calculation which result 
from working with complex numbers, are now temporarily dis- 
carded and these small simple numbers are used instead. To do this 
it is necessary first to adopt an assumed mean value, choosing, if 
possible, the mid-mark closest to the actual arithmetic mean. This 
is largely a matter of experience, but it does not matter in any funda- 
mental sense if the mid-mark chosen as the assumed mean is in fact 
markedly different from the actual mean. This will not invalidate the 
answer obtained, nor will it necessitate any change in method of 
computation. Its sole effect is that the resultant calculations will 
involve larger numbers than would otherwise be necessary. In the 
present example the assumed mean (indicated by x,) is entered as 
ieee 
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Table VI 


The calculation of the average and the standard deviation by the 
grouped frequency method, using the same train per day data as in 
Table V 


Deviation of Total Total of 


Class Class Class cellfrom%,) deviation deviation 
marks bounds mid-mark Frequency inunitsofe of class squared 
x f t ft fe 

1- 2 0-5— 2°5 1:5 2 —5 —10 50 

3- 4 2°5— 4:5 3°5 3 —4 —12 48 

5- 6 45-65 5:5 3 —3 —9 27, 

7T- 8 65-85 7:5 3 —2 — 6 12 

9-10 8-5-10°5 95 4 —1 — 4 4 
11-12 10-5-12:5 11-5 4 0 0 0 
13-14 12:5-14:5 13-5 1 +1 1 1 
15-16 14-5-16°5 15-5 2 +2 4 8 
17-18 16:5-18°5 17-5 1 +3 3 9 
19-20 18-5-20°5 19-5 2. +4 8 32 
Assumed mean Xp = 11:5 25 —25 191 
Cellintervale =2 af 2 ft 2 ft? 


Arithmetic average: 


—25 
— ]]{- ba a HU Ds on 
Hot 2 (>) 11-5 + (—2) 


= 11:5 —2=9:5 
Standard deviation: 
i fe e fy 
SIC i) Sen EN 
xf uf 
FSM 75h 2 eri tlss aso 
= —— = | ——) = 2.7764 —1 
pee (= ) 
2V6-64 = 2 x 2-58 
5+] 


Q 


I 


lon 
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It is now time to begin the real calculation, having transferred 
the numerical data into a suitable form. Deviation from the mean is 
calculated not in absolute terms but rather as the number of ‘units 
of cell interval’ away from the assumed mean, i.e. the number of c 
units away from the class mid-mark chosen as X, and these are 
entered in the fifth column under ¢. The class with the mid-mark equal 
to the assumed mean has a value of 0 entered under ¢, indicating that 
the class as a whole is being considered as equal to the mean. Other 
values range successively as negative values (—1, —2, —3 etc.) and 
positive values (+1, +2, +3 etc.) in the appropriate directions. 
This gives the amount by which the ce// deviates from the assumed 
mean. To obtain the total deviation within any given cell it is neces- 
sary to multiply this deviation value by the number of occurrences 
within that cell, ie. to multiply t by f This value of /t is entered in 
the sixth column in Table VI, the total deviation of the whole series 
being entered at the foot of the column as & /t. 

The calculation of the average from these retabulated data is now 
possible. It will normally be found that the assumed mean differs to 
some extent from the actual mean, although the amount of this 
difference is not known in advance. The correction for this difference 
is simply obtained, however. In column six of the tabulation is given 
the total amount by which the actual data differ from the assumed 
mean. If this value & ft is divided by the total number of occurrences, 
i.e. by Xf, the amount by which the actual average differs from the 
assumed average is obtained. This value is given here in units of cell 
intervals and must therefore be multiplied by this cell interval value, 
i.e. by c, to transfer this difference into actual numbers. By adding 
this difference to the assumed mean the actual mean is obtained. 
The formula for this calculation is set out in Table VI. In this 
example it is found that the assumed mean differs from the actual 


mean by one cell interval, i.e. = = —l. As the cell interval is a 
value of 2 then the assumed mean must be adjusted by —2 to give the 
actual mean. The calculations indicate that this adjustment must be 
by subtraction, for the assumed value is higher than the actual one, 
so that the actual mean is given as 9-5. This is a very close approxima- 
tion to the true value obtained by normal calculations (Table V), 
which was 9-6. This value of 9-5 obtained by the grouped frequency 
method, as it is called, is exactly the same as one of the class mid- 
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marks. If this value had been chosen as the assumed mean, which 
could quite easily have been the case, the total deviation value under 
2 ft would have been 0—a simple calculation will show this to be 
true. In that case the adjustment to be applied to the assumed mean, 
under the factor es would also have been 0, thus giving the same 
answer as obtained in Table VI. This also stresses the point made 
earlier that the closer the assumed mean is to the actual mean, the 
easier the calculations that need to be made. As for the difference 
between the mean value by this method and that which is correctly 
obtained by the normal method of calculation, this results mainly 
from the fact that only 10 cells were used, this being the minimum 
desirable number. If the number of these had been larger, and the 
size of the cells thus smaller, then the difference between the two 
methods would have been less. As it is, the difference is no greater 
than one decimal place, and an accuracy as great as this with as little 
involved calculation will prove of inestimable value in the case of 
sets of data comprising several hundreds of occurrences. 

This account of the grouped frequency method of calculating the 
average has been something of a digression, but as in practice the 
average is usually calculated from the same tabulation as is the 
standard deviation, its inclusion at this point is pertinent. To obtain 
the standard deviation by this method requires some further calcula- 
tion. As with the ordinary method of calculation, the sum of the 
squares of the deviations is needed, i.e. the deviation ¢ is squared and 
then multiplied by the frequency in the cell f. This is obtained from 
the seventh column in Table VI, where this value is given in terms of 
cells, under the head & /t?. Again applying the standard formula 
this value has to be averaged, i.e. divided by & f, to give the variance, 
from which the standard deviation can be obtained by taking the 
square root. In the present method, however, these deviations have 
been measured from an assumed mean, so that as in the case of the 
average outlined above a correction must be applied for this. This 


.. aft 
correction is the same as for the average, i.e. = save that this value 
must be squared to ensure that it is of the same proportions as 
the deviations which were themselves squared. Once this is subtracted 
from the mean of the sum of the squares, the square root can be 
obtained, thus giving the standard deviation in cells. To obtain the 
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correct answer this value must now be multiplied by the cell interval. 
This descriptive account is most clearly appreciated if it is closely 
followed in Table VI, bearing in mind that the formula given is 
fundamentally the same as Method 1 presented for the standard 
deviation on p. 22. The only differences are that a correction factor 
is introduced to allow for the assumed mean and the answer has to be 
multiplied by the cell interval because all calculations are in terms of 
cells. In the example in Table VI the standard deviation is given as 
5:16, while the answer by the usual method of calculation is 5-10. 
Once again, as with the average, the margin of error is very small, 
despite the limited number of cells used. The amount of time saved, 
if the number of occurrences is considerable and if they include large 
and awkward values, is such that the small error is usually worth 
accepting. 

As this method may appear rather complicated, although in reality 
it proves very simple to work, a clearer understanding may be 
obtained if another example is presented, this time with greater 
numbers involved. The problem to be analysed can be outlined in the 
following way. During a study of farming it is found that poultry 
plays a part in the economy of all the farms in the area under review. 
The number of poultry kept varies considerably, however, from as 
low as 2 to as high as 200, and it is desired to discover the average 
number of poultry per farm and also the standard deviation of this 
set of values. The data are set out in tabular form in Table VII. 
Twenty cells are defined, from 1-10 to 191-200, the cell interval, 
i.e. c, being 10. The mid-marks of each cell are also defined, these 
being 5-5, 15:5, 25-5 etc. up to 195-5, and they are entered under x. 
In the following column is shown, under f, the frequency with 
which occurrences fall within the given cells, and it is seen that there 
are 1,044 occurrences altogether, ic. Xf = 1,044. With a number 
such as this the full calculation of average and standard deviation 
values would obviously be a lengthy process. As the frequency 
distribution appears to be a relatively normal one, the assumed 
mean is chosen at about the central point so as to keep the size of 
numbers to a minimum. It is therefore taken as 105-5. The deviation 
of each cell from this value, in terms of the number of cell intervals, 
is then entered under f, the value 0 being entered agaiust the cell with 
the same value as the assumed mean, while values of —1, —2 etc. 
extend to the cells with progressively lower mid-marks and values of 
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Table VII 


The calculation of the average and the standard deviation by the 
grouped frequency method, and the application of Charlier’s Test, 


using data about the number of poultry per farm 


Total 
Class Class Fre- Deviation Total deviation 
Marks Mid-marks quency of Cell deviation squared For test 
x fa t ft ie Jett)? 
1- 10 5°5 5 —10 — 50 500 405 
11- 20 15:5 12 — 9 —108 972 768 
21- 30 255 19 — 8 —152 1,216 931 
31- 40 35:5 24 — 7 —168 1,176 864 
41- 50 45:5 33 — 6 —198 1,188 825 
51- 60 55:5 52. — 5 —260 1,300 832 
61- 70 65:5 69 — 4 —276 1,104 621 
71- 80 75°5 is — 3 —225 675 300 
81- 90 85:5 108 — 2 —216 432 108 
91-100 95-5 120 — 1 —120 120 0 
101-110 105-5 123 0) 0 0 123 
111-120 115°5 101 +1 101 101 404 
121-130 125-5 85 +2 170 340 765 
131-140 135-5 79 + 3 237 7S 1,264 
141-150 145-5 60 4 240 960 1,500 
151-160 155-5 43 + 5 215 1,075 1,548 
161-170 165-5 21 + 6 126 756 1,029 
171-180 T75:5) 9 + 7 63 441 576 
181-190 185-5 4 + 8 Sy 256 324 
191-200 195-5 2 +9 18 162 200 
Assumed mean 1,044 — 571 13,485 13,387 
Xo = 105*5 
Cell interval Dy Lft x ft? Lf(t+1)? 
c= 10 
Charlier’s Test: 


Page) fe 2 fe ay 
ie. 13,387 = 13,485 — 1,142 + 1,044 
Arithmetic Average: 


Ef 


—571 
eet ; _—— = 105-5 — 5-46 = 100-04 
= 105-5 + 10 1044 5 ny es 
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Standard deviation: 


en ca/ se sf) 


13,485 S571 \2 eee — 
= Rae PIE od Peco es Pde 9 — 0-299 — 12:6 
1on/ 1,044 (=) 1ov'129 lov 


= 10 X 3-55 = 35:5 


+1, +2 etc. to those with progressively higher mid-marks. The total 
amount of deviation in each cell is then given under /t and the total 
deviation from the assumed mean in the whole record is given as 
x ft. Also, in preparation for the standard deviation calculation these 
deviations are squared for each cell and this value again multiplied 
by the frequency in that cell, the answer being shown under ft? for 
each cell and under & f#? for the full record. 

From this point the calculation is both simple and speedy. The 
average X is obtained by adding a correction to the assumed mean 
Xo. This correction is the average amount by which each occurrence 
differs from the assumed mean, this being zero if the true and the 
assumed means are the same. As this correction is in terms of cells it 
must be multiplied by the cell interval c before being added to the 
assumed mean, which is in actual values. For the present example 
the calculation of the mean by this method is set out below Table 
VII, and an answer of 100-04 is obtained. As for the standard devia- 
tion, the mean of the sum of the squares of the deviations from the 

“ime 
xf 
to deviations from the actual mean by subtracting the above correc- 
tion factor squared; and the standard deviation is obtained when the 
square root is calculated for this amount and converted from cells 
into actual values by multiplying by the cell interval. The answer in 
this case is seen to be 35:5. 

When a calculation of this sort is being made, however, there is 
always the possibility of arithmetical errors creeping in. It is there- 
fore desirable to institute a check upon the accuracy of the calcula- 
tions in the tabulation. In the present case this is most easily provided 
by the application of what is known as Charlier’s Test. As the result 
of some slight increase in calculation this test indicates whether the 
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main body of the working has been carried out properly. It must be 
admitted that it is not an infallible test, as it is possible for some 
compensation of errors to occur within the test, but this is extremely 
unlikely. As can be seen in Table VII, this test is applied by 
Lfte+ lP=VUf*+2Uf4+I0F 

To obtain the value & f(t + 1)%, the simplest method is to add an 
extra column to the table. For each cell one digit is added to the t 
value; this is squared and then multiplied by the f value. Ideally, 
this calculation should be carried out before the average and standard 
deviation values are worked out, so that any corrections can then 
be made and needless labour avoided. This has been done in Table 
VII, where both sides of the equation are seen to be the same (13,387 
in this case) thus indicating that the numerical calculations in the 
tabulation have been carried through accurately. Any small errors 
that remain are simply the result of the generalizing process on which 
this method is based. 


Variability Indices 


In all these assessments of deviation which have so far been con- 
sidered, the deviation value has been expressed in absolute terms— 
that is to say, in terms of so many inches of rainfall, so many tons of 
iron-ore, so many trains per day per town or so many poultry per 
farm. Within any body of data, however, the magnitude of this 
value is at least in part controlled by the magnitude of the mean 
value. This can be seen in Table IV which has already been considered, 
and also in Fig. 13a, where the mean deviation of annual rainfall for 
some 230 stations in the British Isles is plotted against the mean 
values for those stations. It is when comparisons are being made that 
the influence of the magnitude of the mean value may be rather in- 
convenient. On other occasions, too, it is useful to be able to consider 
the relationship between deviation and the mean value itself. 

In the case of all three of the deviation values which have been 
outlined above (standard, mean and quartile deviations) it is possible 
to indicate this relationship in the same way. This is by simply 
expressing the deviation value as a percentage of the requisite mean, 
thus eliminating the direct influence of the magnitude of the mean 
and facilitating comparison in relative terms between various sets of 
data. This resultant value can be regarded in two ways, both of 
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which are valid. On the one hand, it represents the percentage 
variability of the set of data obtained in the way suggested above. 
On the other hand, if each individual deviation is expressed not in 
absolute terms but as a percentage deviation from the mean, then 
the whole calculation of the standard deviation, for example, could 
be made in these terms. In this way would be obtained the percentage 
standard deviation. 

It is as an index of variability, however, that this percentage value 
is most often used by geographers, especially when distribution 
maps of variability are required. The elimination of the direct 
influence of the mean is its great value in these cases, as can be seen 
in Fig. 135. Here the rainfall data from Fig. 13a are presented in 
terms of percentage rather than absolute values, and the removal of 
any direct relationship with mean value is clearly to be seen. 

If the median and quartile deviation values are being used, an 
index of variability can thus be easily obtained by 
quartile a x 100% 

median 
In terms of the Bidston rainfall data set out in Fig. 11 and Table I, 
this calculation becomes 

2°51 See 

8:07 * 100% = 8:9% 
Equally, the resultant values for the iron-ore production listed in 
Table II are given in Table VIII, along with the similar values for 
other methods to facilitate comparison. 


Table VIII 


Indices of variability for the iron-ore data previously presented in 
Table II 


(Relative (Coefficient of 
Quartile Variability) Variation) 
Country deviation Mean deviation Standard deviation 
Median Average Average 
% % % 
Belgium 41:3 38-2 45:1 
France 44-85 46:0 53:3 
Luxembourg 28:1 30:2 36:4 
United Kingdom 6°65 10-8 14-85 
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When the average is being used, the index of variability depends 
on the deviation value. If the simpler mean deviation is employed, 
the variability value 


mean deviation 


x 100% 
average 


is usually referred to as the relative variability. On the other hand, if 
it is the standard deviation which is used, so that the calculation is 
standard deviation x. 100%, 
average 

then this value is known as the coefficient of variation and is written 
in formulae as V. 

These three methods possess the advantages and disadvantages 
which are implicit in the values on which they are based, and which 
have been outlined earlier (pp. 13-22). In brief, this means that it 
is the coefficient of variation which is mathematically most correct 
and which therefore has the greatest potential value for assessing 
yet further the characteristics of the data under review. It is the rela- 
tive variability which at present is most widely used by geographers, 
however, its easier and quicker calculation being a great advantage 
provided that no further calculations are intended. Moreover, the 
fairly close relationship between mean and standard deviations 
indicated in Fig. 12 means that isopleth maps based on these two 
different methods of indicating variability usually present virtually 
the same pattern—although the quantitative picture is necessarily 
different. Thus if only a relative comparison between areas is 
required the simpler method may be preferable, but if the results are 
to be used to assess, for example, the probability of certain conditions 
obtaining (Chapter 4) then the coefficient of variation is essential. 

The different answers which are obtained by these three indices 
of variability are shown in Table VIII, where the iron-ore data used 
previously are employed. Several relevant points are stressed by this 
table. Firstly, it can be seen that the order of the countries in terms 
of the magnitude of variability is the same whichever method is used 
—France, Belgium, Luxembourg and the United Kingdom. Secondly, 
it is equally clear that the values differ markedly between the methods. 
This means that whenever variability is being presented it is essential 
that the method by which it is assessed be clearly stated. In most 
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cases the method using the quartile deviation gives the lowest value, 
though this is not invariably the case, e.g. Belgium in this table. 
The failure to maintain a regular place in the order is the result of 
the limitations associated with the median and quartile deviation 
indicated on p. 19. With the other two methods, however, the co- 
efficient of variation is always greater than the relative variability 
value, the difference between them being of the order of 25% as in 
the case of the mean and standard deviation (p. 22). The third major 
point from Table VIII is that in comparison with Table IV, where 
the deviation values are shown, the relative position of these four 
countries has changed, e.g. although the standard deviation for the 
U.K. is the second highest, its coefficient of variation is the lowest, 
i.e. the influence of the magnitude of the average has been removed. 

In all these calculations, however, it must be remembered that a 
normal frequency distribution is assumed as is done for the majority 
of statistical methods (p. 8). This does not always occur, and there- 
fore an index of overall variability will not adequately reflect the 
different tendencies and degrees of variability above and below the 
mean. This is again of major importance if further calculations of 
probability are to be made (Chapter 4). It is therefore at times 
desirable to calculate the deviation above the mean separately from 
that below the mean. This can be done for all the three methods but 
an illustration of its effect for the coefficient of variation will make 
the necessary points, using the following set of simple data. 


‘¥ d @ 
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Thus the overall standard deviation is 3-0 and as the average is 
6:0 this gives a coefficient of variation of 50%. The distribution is 
negatively skew, however, with a ‘tail’ to the left (p. 9), and there- 
oF, 
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Figure 14. The relative variability of annual rainfall over the British Isles, 1901- 
1930 (from S. Gregory, Quart. J. R. Met. Soc., 81 (1955)) 


fore the positive and negative deviations from the overall average 
have been analysed separately. This yields standard deviation values 
of 2-15 in a positive direction and 4-45 in a negative direction, and 
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therefore coefficients of variation of c. 36% and c. 74% respectively. 
This rather more involved calculation thus allows a more accurate 
picture to be seen. 
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Figure 15. The coefficient of variation of annual rainfall over the British Isles, 
1901-1930 


In these various ways, therefore, the scatter of actual values about 
the mean can be calculated and expressed. These calculations are of 
varying degrees of complexity and can be made to various degrees 
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of accuracy, depending on the method used. The chief decision that 
any individual has to make, however, is in what way the scatter of 
values should be expressed for any particular set of data. Thus the 
variability of annual rainfall over the British Isles can be expressed 
either by the Relative Variability (Fig. 14) or by the Coefficient of 
Variation (Fig. 15). The issues involved include the purpose for which 
the calculation is being made, whether any further calculations are 
to be based on the results, the nature of the original data, the audience 
to whom the results are to be presented, the presence or absence of 
mechanical or other aids to computation, and the degree of accuracy 
required. Decisions on these and other matters will control which of 
the methods presented in this chapter should be used in any par- 
ticular case. It must be appreciated, however, that both the quartile 
and the mean deviation, as well as their associated indices of vari- 
ability, do not lend themselves to the assessment of further char- 
acteristics of the data, and that they (like the median and the mode 
as mean values) have thus only limited use. In the presentation of 
further methods of analysis in the rest of this book it is therefore 
upon the arithmetic average, the variance, the standard deviation and 
the coefficient of variability that both calculations and theoretical 
arguments must necessarily be based. This will become immediately 
apparent in the following chapter where the implications of the mean 
and deviation parameters will be presented and illustrated in terms 
of geographical problems. 


4d 


CHAPTER 4 


THE NORMAL FREQUENCY DISTRIBUTION 
CURVE AND ITS USES 


In the previous chapters it has been shown that in order to represent 
a body of data adequately two parameters or characteristics need to 
be defined—the mean and the deviation about that mean. Moreover, 
it has been argued that of the various ways by which this might be 
done, the most effective and soundest method is by the use of the 
arithmetic average and the standard deviation. In these two values 
the body of data is briefly but satisfactorily summarized. 


Characteristics of the Normal Curve 


However, if it were stated that the average yield of wheat per acre 
for a series of farms was 30 bushels and that the standard deviation 
was 5 bushels, what would this imply ? What extra information could 
be interpreted from such a statement? The point that must be borne 
in mind is that the standard deviation presents a summary of the dis- 
tribution curve of the data concerned, while the mean indicates the 
actual value about which this curve is distributed. On the assumption 
that the frequency distribution is a normal one (which is the usual 
assumption in statistical methods unless otherwise specified), this 
curve which the standard deviation represents is symmetrically placed 
about the central point which the average indicates. Thus if in several 
records the means differ but the standard deviations are the same, 
then the shape of the distribution curve will be the same in all cases 
but related to a different point on the magnitude scale—this is por- 
trayed diagrammatically in Fig. 16. Conversely, if the average is kept 
constant while standard deviation values differ, different curves are 
indicated around the same central value. Again, these differences are 
seen in Fig. 16. 

Within the area enclosed by each of these curves and the base line 
are recorded all the occurrences which contribute to the mean value, 
these being accounted for in terms of both their order of magnitude 
and the number of occurrences at each such order of magnitude. If, 
as stated above, the standard deviation summarizes the shape of the 
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distribution curve then it equally summarizes the number of occur- 
rences at each order of magnitude. The point is that given a normal 
distribution curve it is possible to postulate the number of occurrences 
at any given value and between given values. The mathematics of this 
are best left on one side. Instead, from a consideration of Fig. 17 it 
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Figure 16. The graphical representation of changes in average and standard devia- 
tion values 
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Figure 17. Percentage values of the normal distribution curve 


is possible to comprehend the more significant characteristics of the 
normal curve in these terms. 

In Fig. 17 is presented a normal curve symmetrically distributed 
about a mean value x, the shape of this curve being expressed by the 
standard deviation of the set of data, i.e. 0. The values used are those 
mentioned at the beginning of this chapter, namely an average of 
30 bushels per acre and a standard deviation of 5-0 bushels. The area 
between the curve and the base line is here divided by vertical lines 
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which are drawn at a distance away from the average (both above 
and below it) equal to the standard deviation and to successive 
multiples of that deviation, e.g. at x plus lo which is 30 + 5 = 35; 
then at ¥ plus 20 which is 30 + 10 = 40 etc. By applying a rather 
complicated formula it is now possible to say what percentage of the 
whole set of data will lie between any successive pair of these vertical 
lines. The values which apply when the distribution curve is truly 
normal are entered on the diagram in Fig. 17 in a somewhat general- 
ized form. 

It can be seen that some two-thirds of the occurrences lie less than 
1 standard deviation away from the average, i.e. between the values 
X¥ — lo and ¥ + lo. Equally about 95% of the occurrences lie less 
than 2 standard deviations away from the average, while less than 
1% of them differ from the average by more than 3 standard devia- 
tions. To be rather more precise, these values imply the following, 
provided that the curve is perfectly normal: 


68-3% of the occurrences will lie between +10 and —1og, i.e. there 
is roughly a 2: 1 chance that a value will lie between those 
limits and a 1 : 2 chance that it will not. 

95-45% of the occurrences will lie between +2c and —2o, i.e. there 
is roughly a 21 : 1 chance that a value will lie between those 
limits and a 1 : 21 chance that it will not. 

99:7% of the occurrences will lie between +3o and —3o, i.e. there 
is roughly a 330 : 1 chance that a value will lie between those 
limits etc. 

99-99% of the occurrences will lie between +40 and —4o, i.e. there 
is only one chance in 10,000 that a value will differ from the 
average by more than this amount. 


The percentage values quoted in this way, or those shown in 
Fig. 17, are very useful as indicators of the scatter of actual values 
about the average, but they only provide figures for whole numbers 
of standard deviations. A more complete picture is obtained if more 
values are available. These have been calculated and are presented 
in print elsewhere. In Table IX a selection of these values is set out. 
This table gives the percentage of the occurrences which will lie 
within a certain number of standard deviations from the mean and 
also the number of standard deviations from the mean that will en- 
close certain percentages of the occurrences. Thus within + and 
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—2-50 lie 98-76% of the occurrences, while 50% of the occurrences 
will differ from the mean by not more than 0-6745 standard deviations. 

Armed with this information it is now possible to look once more 
at the crop-yield example quoted at the beginning of this chapter. 
By reference to the percentage points of the normal distribution given 
in Table IX, it can readily be calculated that, for example, 80% of the 
occurrences lie between 23-59 and 36-41 bushels per acre (i.e. the 


Table IX 

Percentage points of the normal distribution 
% o % o 
10 0-1257 90 1:6449 
20 0:2533 92 1:7507 
30 0:3853 94 1-8808 
38-30 0-5000 95:45 2:0000 
40 0:5244 96 2:0537 
50 0:6745 98 2°3263 
60 0°8416 98-76 2:5000 
68:26 1:0000 99 2°5758 
70 1:0364 99-73 3-0000 
80 1:2816 99-95 3-5000 
86:64 1:5000 99:99 4-0000 


% = the percentage of the occurrences that will lie not more 
than the given number of os away from the mean. 
o = the number of standard deviations away from the mean 
within which limits the given percentage of the occur- 
rences will lie. 


For full details see: D. V. Lindley and J. C. P. Miller, Cambridge Elementary 
Statistical Tables, Cambridge, 1953 (Table II). 


mean + /—1:2816o0), or that although the average value is 30 bushels 
per acre, individual values lie outside the range 27-5 to 32-5 bushels 
per acre on 61-7% of the times. The ability to assess such aspects 
with little further effort once the average and standard deviation are 
calculated represents one of the greatest advantages of working in 
terms of those units rather than any of the others which were con- 
sidered in the previous two chapters. 

In doing this, however, it is necessary to be sure that the distribu- 
tion is reasonably normal. Given that this is so, the percentages 
quoted above will be found to hold true. This can be clearly illus- 
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trated from the Bidston rainfall data analysed previously. It was 
shown in Table III that the average for this set of data was 28-45” 
while the standard deviation was 3-45”. From the percentage points 
of the normal distribution (Table IX) it can be seen that between +lo 
and —lo, i.e. between 31-90” and 25:00’, should lie 68-3% of the 
occurrences or a total of 20-5 occurrences. In reality, as reference to 
Table I where the values are arranged in order of magnitude will 
show, 21 of the 30 occurrences, or 70 %, fell between these limits. This 
is as close to the calculated value of 68-3% as could possibly occur. 
Equally, between 35-35” and 21-55” (i.e. between + and —2c) lie 29 
of the occurrences which can be compared with the 28-6 occurrences 
(95-45%) which theory forecasts. In this particular set of data no 
value differs from the average by as much as 3 standard deviations. 
As such a difference should theoretically happen only 3 times in 1,000 
and there are only 30 occurrences here being studied, this is to be 
expected. Thus, as a concise statement of annual rainfall conditions 
at Bidston an average of 28-45” and a standard deviation of 3:45” 
tells a very great deal, and the addition of the standard deviation to 
the more usual information of the average vastly expands the infor- 
mation provided. 


Three Standard Deviations Check 


Furthermore, this example also illustrates another use of the stan- 
dard deviation. It has been seen in the foregoing example that no 
value differs from the average by more than 3 standard deviations, 
largely because the number of occurrences is relatively few. Even 
with a large body of data a difference this great is only to be expected 
once in more than 300 occurrences. When assessing scatter by the 
standard deviation it is therefore useful to check the accuracy of the 
calculations and the data by seeing whether any record does differ 
from the average by more than 30. For example, if such a large differ- 
ence from the average is found in a record of only 50 values, it is 
advisable to regard such a value with some suspicion. It may well be 
truly valid, for the exceptional case has to occur some time and it 
may be within the 50 occurrences being studied. On the other hand 
it is as well to check for errors. Perhaps a figure has been wrongly 
written or read; a small change in the character of the data may have 
been missed; a rain-gauge may have developed a leak!—in other 
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words, the set of data may not be strictly homogeneous nor the dis- 
tribution curve approximately normal. This ‘3 standard deviation’ 
check is therefore a safeguard against really gross errors. Thus it 
could well have been applied to the answer obtained by the grouped 
frequency calculation of the standard deviation of the ‘number of 
poultry per farm’ data set out in Table VII. Three times the standard 
deviation of 35-5 equals 106-5. As the average value was 100-04 and 
the range of values from 2 to 200 (p. 34), no value differs from the 
average by more than 3a, despite the 1,044 occurrences. Clearly no 
major discrepancy would seem to be within the record, although this 
is no safeguard against minor ones. 


Probability Theory 


In these paragraphs on the normal frequency distribution mention 
has several times been made of the percentage of occurrences within 
certain limits, or of the chance of certain values occurring. This intro- 
duces a theme which is fundamental to the whole of statistical 
analysis, namely the theme of probability. In the case of a large pro- 
portion of analyses one of the main purposes is to assess the prob- 
ability that given values are likely to occur, or to be exceeded, or not 
to be reached. From other points of view the problem may also be 
posed in a rather different way although it is still basically the same 
problem. Thus the question may be asked as to the probability 
that certain events are likely to occur at given intervals, or that a 
certain distribution pattern has some significant meaning. Moreover, 
even to interpret the average properly it is necessary to think in 
terms of probability—the probability of its being exceeded or not, for 
example. 

This field of probability theory is vast and complex in detail, 
although its fundamentals are simple enough. If the full set of any 
body of data is considered, the probability that any individual occur- 
rence will lie between the values for the outer limits of that complete 
set is obviously 100%. Equally, the probability of any value being 
equal to or lower than the highest value of the set is also 100%, as 
is the probability of any value being equal to or greater than the 
lowest value of the set. In other words, if the full set of data is con- 
sidered it must necessarily contain all the events and therefore all the 
probabilities. The fact that the full set of data represents 100% prob- 
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ability is shown by expressing this total probability as unity, ice. 
as 1-0. 

If, on the other hand, the probabilities of values being greater than 
average and less than average were to be considered then, assuming 
a perfectly normal frequency distribution, each of these events would 
occur with a 50% probability, i.e. there would be an equal likelihood 
of a value being above or below the average, and a complete certainty 
that it must be one or the other. This can be tabulated as follows: 


probability of a value being greater than average = 50% or 0:5 
less * es 0 005 


29 be) 99 99 


total probability of the value being greater or less 
than the average = 100% or 1-0 


Again, by reference to Table IX it can be seen that the following 
probabilities hold true, if the distribution is normal: 


probability of a value differing from the 
average by less than 2c 

probability of a value differing from the 
average by more than 2c 


I 


95-45% or 0-9545 


4-55% or 0:0455 


I 


total probability of a value differing from the 
average by more or less than 2c = 100% or1-0 


These two simple examples make it clear that the sum of the indi- 
vidual probabilities within a set of data is the same as the total 
probability which is unity. 

The problem of assessing the probability with which given values 
or events are likely to occur is thus basically a problem of deciding 
how to allocate the total probability between the various possibilities 
under review. In the above examples only two possibilities were pre- 
sent in each case, but far more complex conditions can be argued in 
the same way. It must be realized and accepted from the outset, how- 
ever, that a statement of probabilities in this way does not indicate 
when the specified conditions will occur. It does no more than assess 
the frequency with which those conditions are likely to occur over 
an infinitely long set of records. The longer that set may be, the closer 
actual frequencies or probabilities are likely to be to these theoretical 
values. This theme will be expanded at greater length when the taking 
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and analysing of samples, and the question of their size, are con- 
sidered in Chapter 6. 

The problem mentioned above of allocating the total probability 
between various possibilities must be decided in terms of the type of 
frequency distribution curve which most closely fits, or approximates 
to, the curve of the data itself. For many sets of data it is the normal 
curve, already partially considered above, which is the relevant one. 
For other sets of data, however, orfor other purposes, rather differ- 
ent distribution curves may apply. Of these the most common and 
most useful are the Binomial Distribution and the Poisson Distribu- 
tion. These two will therefore be considered and illustrated in 
Chapter 5, after the normal curve and its implications have been 
further examined. These various possibilities of assessing probability 
will be simply presented with a minimum of mathematical theory 
and a maximum of practical value. 


Probability and the 
Normal Frequency Distribution 


From the consideration of the percentage points of the normal] dis- 
tribution earlier in this chapter several indications were given con- 
cerning the probability with which specified conditions occur. Thus 
it was seen that the probability of a value differing from the average 
by more than 20 was 4:55%, or, in terms of annual rainfall at Bidston, 
that values outside the range 21-55” to 35-35” will occur with this 
same probability or frequency. Very often, however, it is not the 
probability of values falling within a certain range which is relevant 
and of interest but rather the probability that values will exceed or 
fall below some given value. For example, it may be of value to know 
the probability that an occurrence will exceed the average by more 
than 2o, or that rainfall at Bidston will be greater than 35-35” (which 
is itself 20 greater than the average). Clearly, if the distribution is a 
normal one the 4:55% probability that a value will differ from the 
average by more than 20 will be equally distributed between the two 
ends of the curve, i.e. between values greater than ¥ + 20 and values 
less than * — 20. This has, in fact, already been shown diagram- 
matically in Fig. 17. Therefore, having obtained from Table IX the 
fact that 95-45% of the values lie between —2o and +-2c, and that 
therefore 455% fall outside these limits, it is simply a matter of 
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halving this latter value to find the percentage of values that are likely 
to be greater than ¥ + 20. So it can be established that 2:275% of the 
values should fall into this category, or in terms of annual rainfall at 
Bidston that in 2:275% of the years rainfall should be greater than 
35:35”. This one chance in 40 does, in fact, occur once in the thirty- 
year record set out in Table I, but the similar probability of a fall of 
less than 21-55” does not occur within that particular short period. 

This reasoning has been presented at some length because it is 
basic to the calculation of probabilities of any and every value. 
Table IX referred to above, from which was obtained the percentage 
probability of values being between +20 and —2<, is not in the most 
convenient form for other calculations. Therefore, if intermediate 
values of standard deviations are required, or if the problem is posed 
in terms of the probability of a given value being exceeded, a simple 
calculation and reference to tables of the normal distribution function 
can be made. The calculation is as follows: 


critical value — mean value 
standard deviation 


required figure = 


— 28 


: x 
which is usually written d = 


The required figure or d is the figure which is needed for reference to 
tables. Into the right-hand side of the formula can be entered the 
mean and standard deviation values, and also the value which is 
being investigated. The calculation gives an answer which indicates 
the extent to which the critical value differs from the mean expressed 
in terms of ‘so many’ standard deviations. Thus, to recalculate the 
earlier Bidston example by this method, the following would be done. 
Suppose that it is desired to know the percentage probability that 
values will exceed 35-35” of rainfall, this being then the critical value 
in the formula. Values can be entered in this way: 

fm Digi S 333 4 a +690 = 229.0 

oO 3-45 3-45 

From this required figure of d = 2:0 the appropriate percentage 
probability is then obtained from Table X, the Normal Distribution 
Function. The value in this case is 2:275%, and since dis positive this 
indicates the percentage probability that occurrences will be more 
than the critical value. This is the same probability as that obtained 
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by the alternative method on p. 53. The probability of occurrences 
being Jess than this value is obtained by ‘100 — tabled percentage’, 
i.e. 100 — 2-275 which equals 97:725%. Conversely, if it were desired 


Table X 

The Normal Distribution Function 
d % d % d % d % d % 
0:00 50:00 0:50 30.85 1:00 15:87 1-50 6:68 2:0 2:275 
0:10 46:02 060 27:43 1:10 13:57 1:60 5:48 2:5 0-621 
0:20 42:07 0:70 24:20 1:20 11°51 1-70 4:46 3-0 0-135 


O30 2 S220; 50) mee ome OF 968 1:80 3:59 3°53 0:023 
040 3446 090 1841 1:40 8:08 1:90 2:87 40 0-003 


If ‘d’ is positive If ‘d’ is negative 
d = the number of standard devia- d = the number of standard devia- 

tions that the critical value is tions that the critical value is 
above the mean. below the mean. 

% = the percentage probability that % = the percentage probability that 
the occurrence will be more than the occurrence will be less than 
the corresponding value of ‘d’; the corresponding value of ‘a’; 
the probability that it will be the probability that it will be 
less than this value is more than this value is 
(100 — %). (100 — %). 


For a more detailed table see D. V. Lindley and J. C. P. Miller, Cambridge Ele- 
mentary Statistical Tables, Cambridge, 1953 (Table I). 


to know the probability of occurrences below 21-55” a similar cal- 
culation would be made: 

X—X 21°55 — 28°45" 6-90 

ES: ama. 

Again Table X may be used, but because the d value in the above 
calculation is negative the percentage values have to be interpreted 
in the reverse way, as is indicated in the footnote to the table itself. 
Interpreted in this way the table gives the percentage probability of 
values being /ess than the critical value, and the adjustment by means 
of ‘100 — tabled percentage’ is used to obtain the probability of 
occurrences above the critical value. So in the present example the 
percentage probability of annual rainfall at Bidston being below 
21-55" is again 2:275%. This dual use of the table is necessitated by 
the fact that it gives values along only one side of the distribution 
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curve, i.e. between the average and any one end of the curve. This is 
because it is concerned with the probability of values being exceeded 
or not, rather than with the probability of values falling within certain 
limits, as was Table IX. These two tables (IX and X) are, of course, 
simply different ways of expressing the same set of relationships, both 
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Figure 18. Rainfall probability map of East Africa (after J. Glover, P. Robinson, 
J. P. Henderson, Quart. J. R. Met. Soc., 80 (1954)) 
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Figure 19. Rainfall probability map of the British Isles 


being based on the form of the normal frequency curve described 
earlier. 

Such studies of rainfall can provide valuable information in rela- 
tion to water-supply problems, irrigation requirements, river run-off 
and flood conditions. Maps showing rainfall probability character- 
istics have been prepared for various countries, and Figs. 18 and 19 
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provide two examples. The implications and possibilities of this 
method will be more fully appreciated, however, if another set of 
data is analysed and various probability features assessed. For this 
purpose it is convenient to use the data concerning the number of 
poultry per farm presented in Table VII and examined in Chapter 3. 
Use of these data has the following advantages: the mean and stan- 
dard deviation values are already calculated, the distribution curve 
has been seen to be very close to normal, and, as the data are already 
tabulated, it is possible to check the answers and so assess their rela- 
tive accuracy and value. The arithmetic average of this set of data 
was 100-04 and the standard deviation 35-5. From these parameters 
it is desired to assess the probability of occurrence of certain condi- 
tions. The fact that it is a large body of data, consisting of 1,044 
items, means that the resulting values should approximate closely to 
the values within the body of data itself. 

The first enquiry is to discover the percentage probability that 
more than 140 head of poultry will occur on a farm—or, to put it 
another way, the percentage of the farms which are likely to have 
more than 140 head of poultry. This can be readily calculated as in 
the previous example, thus: 


x—<X critical value — mean value 


pe 


ta standard deviation 
n 140 — 100-04 » +39-96 — 41-125 
35°5 35°5 


In this case the d value is positive, so that the percentage probability 
of exceeding the critical value can be read directly from Table X. This 
indicates that more than 140 poultry can be expected to be found on 
13-03% of the farms. By summing the numbers of occurrences in the 
cells in Table VII it is found that 139 farms out of 1,044 fall into this 
category, i.e. 13-30%, which differs but little from the assessed value. 

Again, it may be desired to find out how many farms have rela- 
tively few poultry, taking 20 as the critical value. The calculations 
follow the same line as above: 

ote 


d= 


oO 


O°) 35°5 
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Here the d value is negative, but as it is the proportion of occurrences 
below this value that is required the necessary value can again be read 
directly from Table X. With the abbreviated version given here it can 
only be placed between 0-621% and 2:275% probability, but from the 
full tables the correct value is seen to be 1:22%. There is some dis- 
crepancy between this and the numbers in this category in Table VII 
for 17 farms out of 1,044, i.e. 1-63°% had this small number of poultry. 
Such differences, however, do not reflect faults in the method—or in 
the calculations! They result partly from the fact that the 1,044 values 
used do not form a completely perfect normal distribution, although 
they approximate sufficiently closely to one to make the method valid. 
Also they occur because the percentage probability values refer to an 
infinitely long series of data, as mentioned earlier (p. 51), while this 
series is finite. It is for this reason that it is incorrect simply to count 
the number of occurrences beyond the critical values specified. Such 
a count would give an answer which would only refer to the 1,044 
occurrences actually available. On the assumption that these 1,044 
occurrences are a true reflection of all the possible occurrences, in- 
cluding those for which data are not available, the probability values 
obtained by the calculations outlined above would apply to the full 
record and not merely to these 1,044 occurrences. The extent to which 
this assumption of being representative of the full record is justified, 
and the ways in which allowance can be made in case it is not, will 
be examined in more detail in Chapter 6. 

Before leaving this theme of probability based on the normal curve 
there is one other aspect to be presented briefly. Apart from discover- 
ing the probability with which given values can be expected to be 
exceeded it is also often desirable to assess the value that can be ex- 
pected to occur or be exceeded with a given probability. For example, 
it could well be of interest to define the number of poultry which is 
equalled or exceeded on 80% of the farms. Probability values are 
tabulated in terms of half the distribution curve, however, so that it 
is convenient to pose the problem in terms of one half of the curve 
or the other, rather than in terms of something that overlaps the mean 
value. Thus, this problem could be put as one of defining that value 
below which will fall 20% of the occurrences. The value must there- 
fore of necessity be below the mean and the d value will be a negative 
one. What will it be? In this case it is possible to know this value in 
advance from the normal curve, for it is the value of x (the critical 
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value) which is to be discovered. To obtain the d value it is necessary 
to consult Table X again to find the value which will ensure that 20% 
of the occurrences fall below it. This is seen to lie between 0-80 and 
0-90, and detailed tables give it as 0-8416. In other words, 20% of the 
occurrences lie more than 0:84160 below the average, while 80% of 
the occurrences lie above this value, i.e. this is the value that the 
problem is concerned with. Now it is possible to insert all but one of 
the values into the probability formula. Thus: 


X¥+do=x 
i.e. 100-04 + (—0-8416 x 35-5) =x 
100-04 — 29-88 = x = 70°16 


Thus 80% of the farms in an infinite series will possess more than 
70-16 poultry (i.e. c. 70 or more), while 20% of the farms will possess 
less than this amount, these computed values being very closely borne 
out from the values in Table VII. This formula for assessing such 
values is simply an adjustment of the one presented earlier, and can 
be put in a standard form as follows: 


critical value = d(standard deviation) + the mean 
orx=do+t+x 
always bearing in mind that d may be a negative value as in the above 


example. The relationships implied by these two forms of the formula 
are presented diagrammatically in Fig. 20. 


d=-0-8416 


30% 
occurrences 


ry 


fo 50% 
occurrences occurrences 


——2 Frequency—s— 


x 


= 
——> Magnitude ——» 


Figure 20. Diagram of calculating probability values from the normal distribu- 
tion curve 
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In all these considerations the probability values obtained have 
specifically omitted any suggestion as to when the stated conditions 
might be expected to occur, while it has been stressed (pp. 51 and 
58) that such values apply more strictly to a large body of data rather 
than to a limited one. Thus there is no implication that because a 
given value is exceeded with an 80% probability that in any 10 occur- 
rences 8 of them will be above the given value, although in 10,000 
occurrences it is likely that 8,000 will exceed it. Even less has any 
suggestion been made as to which of any 10 occurrences are likely 
to exceed that value, and which drop below it. This falls in the realms 
of forecasting, not of statistical analysis. What can be attempted by 
means of statistical analysis, however, is to indicate the probability 
that in any ten occurrences 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 of them 
will fall into the category of exceeding the given value. To be able 
to do this has obvious practical implications in terms of the reliance 
that can be placed on conclusions drawn from certain amounts of 
data, or the number of occurrences that need to be considered before 
an adequate degree of reliability can be obtained—a theme to be 
taken up more fully in Chapters 6 and 7. Also, in terms of interpret- 
ing data the probabilities of certain conditions occurring with a parti- 
cular frequency may be far more valuable thana simple use of the mean 
or even of the overall probabilities already considered in Chapter 4. 


Probability and the 
Binomial Frequency Distribution 


To obtain such probability values involves the consideration of 
another distribution curve, namely the Binomial Distribution. This is 
concerned with the relative frequency of occurrence of two numbers, 
or rather sets of conditions, which are mutually exclusive and which 
together represent the sum total of probability. Thus once a given 
set of conditions or a value is accepted as being critical and therefore 
worth analysing, then all the occurrences in the body of data can be 
classified as either belonging to that set of conditions or as not so 
belonging. This gives the overall long-term probability of these con- 
ditions occurring, either by counting in a finite body of data or by 
assessment from the normal distribution curve for an infinite body 
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of data. Given that some specified number of occurrences are to be 
considered, it is possible either for all these occurrences to belong to 
that set of conditions or for none of them to belong to that set of 
conditions or for some to belong and some not to belong, the pro- 
portions of each being liable to as many differences as there are 
occurrences under study. The prime characteristic of the binomial 
distribution is that it reflects the frequency (or the probability) with 
which these different possibilities are likely to occur, for any given 
percentage probability of the specified conditions and any given 
number of occurrences being considered. 

A simple illustration may help to make the general principle clear 
before actual examples are analysed. Assuming that the data under 
consideration are normally distributed, what is the probability that 
in choosing two occurrences both will be above average or that both 
will be below average, or that there will be one above and one below 
average? In this case the number of occurrences being considered is 
two, while the specified set of conditions is that the value is above 
average. The overall probability of an above-average value is 50% 
or 0-5, as a normal distribution is assumed. Equally, the probability 
that a value will not be above average, i.e. will be below average, is 
also 0:5. From these data it is now possible to assess the probabilities 
sought at the beginning of this example. In a simple case such as this 
it can be done by tabulating all the possible combinations. 


First Second Third Fourth 

possibility possibility possibility _ possibility 
Above average 1. 2. Je Zs — 
Below average = Z. 1; i eZ 


Thus both occurrences could be above average; both could be below 
average; and there are two ways in which one above average and one 
below average value could occur. In other words out of four possible 
combinations, only one could give both occurrences above average, 
i.e. the probability of this happening is 0-25. This is also true for both 
values below average, while there is a.0-5 probability of one of each 
of the two categories occurring. If this example is now turned from 
numbers into symbols the means by which these probabilities are 
obtained will be seen. Thus the specified above-average conditions 
can be called p, and those occurrences that do not satisfy these con- 
ditions can be called g, the data being retabulated in this form. 
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First poss. Second poss. Third poss. Fourth poss. 
SY MDOISS er Peta E D> OP q a9aiee q g 
Individual 
probability: 0-5 Oe 0:5 0:5 0:5 0:5 0:5 0:5 
Overall 
probability: 0:25 0:25 0:25 0:25 
=05x05 =0°5 x 05 = 0:5 x 0:5 = 0:5 x 0:5 
oll FST Pd 2 Plo Or Pe eee 
mnie = 2pq 4) 


Thus the two probabilities of 0-25 and the one of 0-5 are seen to 
result from the multiplication of the individual probabilities, this 
“Multiplication Law’ applying in the case of the simultaneous occur- 
rence of events as well as for assessing the probability of events in 
succession (see pp. 160-161). The essential ‘rightness’ of this process 
and of the results is clear in the tabulation. Moreover, the setting 
in succession of the terms p®, 2pq and q? should recall certain aspects 
of simple algebra acquired at the age of twelve or thirteen, for 
p* + 2pq + @® is the expansion of (p + q)?. In other words, the 
probabilities of getting 2 occurrences of p, 1 occurrence of each of 
p and q, and 2 occurrences of q are given by the terms of the expan- 
sion of (p + q)?. Furthermore, the power to which (p + gq) is raised, 
i.e. 2, equates with the number of occurrences being considered, i.e. 
2, and it can be shown that the same relationship holds true what- 
ever number of occurrences are being considered. Therefore the 
general formula for obtaining the individual terms of the binomial 
distribution is written as (p + q)”, the expansion of this yielding the 
successive probabilities from all occurrences of p to all occurrences 
of g. 

This is applied in the following way. In a given set of data it is 
known that the proportion with characteristic p is 0-2 so that the pro- 
portion without this characteristic, i.e. q, is 0-8. It is required to know 
the different probabilities of the various possible combinations of 
p and gq, if 5 occurrences are being examined. The basic formula 
(p + q)” thus becomes (0-2 + 0-8)5, or in its expanded form 


p? + Sp*q + 10p*q? + 10p%g8 + Spq* + 95 

Inserting the appropriate numerical values this becomes 
0:0003 + 0:0064 + 0-0512 + 0-2047 + 0-4097 + 0-3277 
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These are then allocated as follows: 


probability of 5 occurrences of p and 0 of g = 0-0003 

” ” 4 » » P and | of q = 0-0064 
ae »» p and 2 of g = 0:0512 
»» p and 3 of g = 0-2047 
s »» p and 4 of g = 0-4097 
e » pand 5 of y = 0:3277 


Total probability = 1-0000 


oOrN Ww 


The definition of the various terms applying to the different fre- 
quencies may well raise problems. The first such problem will prob- 
ably be to assess the number of times by which the various powers 
of p and g must be multiplied. This can be obtained without calcula- 
tion by the use of what is known as ‘Pascal’s Triangle’, which is set 
out in Table XI. The values in this table can be simply extended be- 
yond n = 10 by the process of addition. Thus, line n = 4 is obtained 
from line n = 3 by adding each successive pair of values in linen = 3 
together, 1.0.0 + 1=1;14+3=4;34+3=6;3+1=4;1+0 
= 1; in this way the coefficients when n = 4 are seen to be 1, 4, 6, 4, 1. 


Table XI 
Pascal’s Triangle 


Number in 
the sample = n Coefficients in the expansion of (p + qg)” 
Ena 


eS 1010) 5 srl 
eco ISt e207 1S er6w Fi 


UNG BP Aih oN SIO 7A Me co i | 
1 See oe Ol OMNES eer oad 


1 9 36 84126 126 84 36 9 4 
P10 455120) 210) 252-210-5120) 45 10. 1 


SoOlen| nN] Bw] NR 


— 


The other possible problem in the use of this technique is to estab- 
lish the powers to which p and q must be raised for the different 


63 


STATISTICAL METHODS AND THE GEOGRAPHER 


terms. Again working from all the occurrences being p to all the 
occurrences being g, i.e. from left to right in Pascal’s Triangle, in the 
first case the power of p is equal to n and that of q is nil. The power 
for the former steadily decreases by one each time moving from left 
to right while that of g equally steadily increases from nil to n in the 
same direction. This can therefore be written as follows: 


P”; p'qs p"*q" etc. to peg"; pq"; q”. 

Thus, if there were 8 occurrences, i.e. n = 8, then the terms of the 
expansion of (p + q)® would be: 

P° + 8p"q + 28p%q® + S6p%q® + 70p*q* + 56p%q* + 28p°q* + 8pq" + 9? 
This gives the full range of probabilities from eight occurrences of 
the given conditions p to no occurrences of these conditions but eight 
occurrences of the reverse conditions g instead. 

A series of practical examples will illustrate this method in various 
ways and will also present several of the sorts of geographical prob- 
lems that can be tackled by the use of this method of analysis. Sup- 
pose, for example, that it were known that in a particular area an 
annual rainfall of less than 20” would result in a very poor harvest 
and furthermore that two such years in succession would lead to 
many farmers becoming bankrupt, much land going out of cultiva- 
tion and the danger of famine. An analysis of the rainfall records in- 
dicates that a rainfall of below 20" is likely to occur with a 10% 
probability, i.e. that there is a 10% chance of such a low value occur- 
ring or that on average it is likely to occur 1 year in 10. One such year 
can be survived, albeit with difficulty, and the problem therefore re- 
solves itself into an assessment of the probability of two such years 
occurring in succession. This question can be analysed by means of 
the binomial distribution, for the probability with which given con- 
ditions will occur is known to be 0-1, and the number of occurrences 


under consideration is 2. Thus into the formula (p + gq)” can be 
entered the values 


p=01 i.e. a 10% probability of receiving the given conditions; 


g=09 ie. a90% probability that these given conditions will not 
be received and rainfall will be above 20”: 


n=2 i.e. the probabilities of receiving p and g in 2 successive 
years is required. 
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The expansion of these terms can be obtained in the way shown 
earlier and can be set out as follows: 


Conditions Calculations Probability 
Both years below 20” = p? = 0-12 = 0-01 
One year below 20” 


and one yearabove = 2pqg=2 x 0:1 x 09 = 0:18 
Both years above 20” = gq? = 0-92 = 0°81 


Total probability = 1-00 


Thus it can be seen that with the conditions that were specified above, 
which were based on both the mean and the standard deviation para- 
meters to obtain the percentage probability value for a year with 
below 20” rainfall, two successive years with this low rainfall will 
occur with a probability of 0-01, i.e. there is a 1% chance of its occur- 
ring. Equally it shows that out of any pair of years there is an 18% 
chance that one of them will be dry, while there is an 81% probability 
that both years will be above the critical value. Values such as these 
may be markedly different from those which are often assumed from 
the study of mean values alone, or even from the more detailed 
results of variability analysis. In this case it means that conditions 
leading to famine, i.e. two successive dry years, will occur very infre- 
quently (technically, once in 101 years) despite the occurrence of dry 
years in 10% of all the years. 

This same method of analysis can, of course, be used in many other 
problems. For example, a given place may have an average long-term 
temperature for its warmest month of 65°F, which may be adequate 
for the maintenance of growth for certain trees. Such a temperature 
may not, however, be warm enough for the fruiting and regeneration 
of such trees, for which a mean temperature for the warmest month 
of 72°F may be required. With a life-span for the trees of about 100 
years, such conditions are only essential at least once a century, to 
ensure replacement as old trees die out. By considering the standard 
deviation of the temperature data it is possible to discover the overall 
frequency with which such warmer conditions occur. If the standard 
deviation were found to be 3°F this would mean that 


Riad, ld — 05, — 1 
SS 2°33 
o 3 3 a 


and from the normal distribution function (Table X) this implies that 
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the critical value, i.e. 72°F, is exceeded on 1% of the occurrences. In 
terms of an infinitely-long series of data the necessary warmth occurs 
with just the minimum frequency to ensure regeneration. There is no 
guarantee, however, that because the overall percentage probability 
is 1% that these conditions will occur with this frequency regularly, 
ie. once every hundred years. The likelihood of temperatures above 
72°F occurring with given frequencies within a period of a hundred 
years can be assessed by the binomial distribution, however. In this 
case the components of the formula (p + q)” are: 


p=0-01 this being the probability of receiving a mean tempera- 
ture for the month above 72°F; 

gq =0-99 this being the probability of not receiving more than that 
amount; 

n= 100 _ this being the critical period within which it is necessary 
for this temperature to be received. 


What is now wanted is the probability of a monthly temperature 
of the warmest month being over 72°F occurring some time during 
a hundred years. As the total probability of values for differing pro- 
portions of ‘above and below 72°F’ must equal unity, the simplest 
way to obtain the required answer is to calculate the probability that 
no year with a monthly temperature above 72°F will occur within the 
hundred years; subtracting this from unity will give the probability 
value required. The probability that there will be no values of p is 
obtained by calculating g”, which is the last of the terms of the ex- 
pansion of (p + q)” (see p. 64). Thus, the probability of no p vaiue 
= gq” = 0:99 = 0-366. Therefore, the probability of some p values 
= 1 — 0-366 = 0-634. 

It can therefore be seen that although there is a more than 60% 
probability that one or more years in a hundred will experience tem- 
peratures adequate for tree regeneration, there is a 35 to 40% prob- 
ability than not even one year out of the hundred will receive such 
adequate temperatures. It would thus appear that it is quite possible 
for trees to fail to regenerate under these conditions, after possibly 
several centuries of continued existence and regeneration, without 
any real change in climate to account for this change in vegetation. 
The ‘change’ which would have occurred would be no more than the 
random occurrence of exceptionally warm conditions with an overall 
frequency of 1%, this necessarily implying that at times a period of 
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more than a hundred years will lapse between such occurrences, 
while at other times these occurrences will be slightly more frequent 
for several centuries. It must not be assumed that the above argument 
proves or disproves changes in climate. As presented here it is simply 
an example of a possible set of relationships, but it does indicate the 
type of problem that may well repay analysis by this method. 

A final example may reinforce the understanding of these methods. 
Suppose, for example, that in a given area it is known that 60% of 
the farms include dairying within their economy. In a brief visit, 
perhaps on a field excursion, it proves possible to visit three farms 
within this area. What are the probabilities of these visits including 
3, 2, 1 or even 0 farms with dairying activities? The components p, 
q and n can be set out as before: 


p (the proportion with dairying) = 0-6 
Ghia, oe Without. 65.) = 0-4 
n (the number of farms being visited) = 3 


The proportions are as follows, still following the working principles 
set out on p. 64. 


The probability of 3 farms with dairying = p? = 0-6 = 0-216 
> 9 » nD, » = 3p’q 
= 3 x 0-6? x 0-4 = 0-432 
» 99 yl ae = = 3pq? 
= 3 x 06 x 0-42 = 0-288 
0» 0 ss » =g=04 = 0-064 


Total probability = 1-000 


So the possibility of the visited farms reflecting the overall balance 
of 60% with dairying, i.e. approximately 2 farms out of 3 with dairy- 
ing, is less than 50%, for there is a more than 20% probability that 
all the 3 farms will include dairying, and more than a 30% chance 
that no more than one of the 3 farms will include dairying. Figures 
such as these are a salutary warning against basing general conclu- 
sions on a too limited study and this whole theme of the size of the 
sample for study and the degree of accuracy that this provides must 
be taken up at greater length in Chapter 6. The above example will 
then be considered again in more detail. 
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Probability and the Poisson Frequency Distribution 


In all these examples of assessing the probability with which given 
conditions occur in a specific number of occurrences the data have 
always been such that they could be divided into those occurrences 
when the given conditions did occur and those when they did not. 
Probability values on an overall basis could thus be ascribed to both 
sets of conditions, under the terms p and q. In some cases, however, 
data are concerned with isolated events in time when although it is 
possible to specify the number of times certain conditions did occur 
it is not possible or not sensible to say how often they did not. For 
example, it is possible to consider the number of times that hail falls 
or fog occurs in a month, or the number of times that a river floods 
in a winter or a wet season, but seeking to know how many times 
these events did not occur is neither sensible nor feasible. 

In such studies as these the data are always discrete, i.e. whole 
numbers, the frequency distribution is usually skew and there is a 
limit to the possibilities in one direction because of zero values and 
perhaps in the other because of magnitude. The question that nor- 
mally requires solution here is the probability with which different 
numbers of these occurrences are likely to occur. Thus it may 
be desired to know the probability of a particular river flooding 
0, 1, 2, 3, 4 or 5 times in a wet season. Here the limiting factor of 
zero values is clearly important, while again it is unlikely that the 
values could continue increasing indefinitely. It would, of course, be 
possible to assess these probabilities by calculating the average and 
standard deviation values, obtaining overall probabilities from the 
normal distribution function and then calculating probabilities from 
the binomial distribution, as has been done with the examples worked 
out above. With a set of data which is markedly skew, however, the 
probabilities from the normal distribution function would be of only 
generalized reliability, and therefore a probability distribution which 
closely approximates to a skew distribution would provide a better 
estimate of probability. 

For example, consider the data set out below concerning the num- 
ber of times a river floods in a wet season. Clearly the frequency will 
differ from one year to another, and the number of years in which 
0, 1, 2, 3, 4 or 5 floods occurred during a period of 100 years is given 
in the following table. 
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No. of No. of 
years floods 
24 0 
35 1 
24 2) 
2 3 

4 4 

1 5 


With the total number of 140 floods during the 100 years, the average 
number of floods per year is 1-4. Calculation will also show that the 
standard deviation for these data is 1-15. By applying the formula 
2a Pam! 


and referring to the table of the normal distribution func- 


tion (Table X) for given values of x, it is found that the estimated 
probabilities by this method overestimate the frequency of years with 
many floods and underestimate the frequency of years with few 
floods. Any estimate by the binomial distribution using these values 
for p and q will therefore equally be too divorced from reality to be 
of real value. 

To be able to postulate probability values in such a case it is neces- 
sary to use a third technique, this being based on the Poisson distribu- 
tion. This distribution utilizes the mathematical constant that is 
written as e, which is derived from the exponential law of natural 
growth (see Chapter 13). Its method of calculation need not be con- 
sidered here but its value, correct to four decimal places, is 2:7183. 
This is used in a series of successive terms which express the prob- 
ability of 0, 1, 2, 3, 4 etc. events occurring. These terms are as follows; 


A 
—Ze —Ze —Z. —Z. 
coe 2.e.; — eo 3, eo —.e 


In these terms 
e is the value 2:7183 indicated above; 
z is the average value for the set of data; 


hatey ve 
e~* is the same as writing — 
e 


! indicates that it is the ‘factorial’ of the number concerned 
hens sEXe2) <1 =.6; 
wile.6! == 650 5.x 4 <3 x 2.x 1 '= 720 


By calculating the values for these terms it is possible to evaluate the 
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probabilities of 0, 1, 2 etc. events occurring, without first calculating 
the standard deviation or making any other prior assessments. One 
thing is required, however, namely that the average or expected num- 
ber, i.e. z, should be constant (or virtually so) from trial to trial, i.e. 
from one set of years to another. 

This formula can be applied to the present data as follows: 


1 


Cae TABS eS ae +7183" _ = 0-2466 = probability of 0 floods 

z.e-* = 1:4 x 0:2466 = 0:3452 = probability of 1 flood 
2 +42 

stent = a X 0:2466 = 0:98 x 0:2466 = 0-2417 = probability of 2 floods 
3 +43 

en" =e a X 0:2466 = 0-458 x 0:2466 = 0-1127 = probability of 3 floods 
4 44 

ee = — X 0:2466 = 0:1602 x 0:2466 = 0:0395 = probability of 4 floods 
> a“ 
5 45 

nae == a x 0:2466 = 0-:0449 x 02466 = 0-0110 = probability of 5 floods 


0-9967 = approximate total 
probability 


To indicate the extent to which this method does provide a valid 
index of the probability with which these events occur, the events 
themselves, the probability values, the frequency which this implies 
over a hundred years, and the actual values presented earlier are all 
tabulated below. 


Number of floods per wet 


season 0 1 2 “) 4 5 
Probability value 0:2466 0:3452 0:2417 0-1127 0:0395 0:0110 
Probable frequency 

per hundred years 25 35 24 11 4 1 
Actual frequency in the 

specified hundred years 24 35 24 12 4 1 


Thus a very close approximation to the actual conditions was pro- 
vided by the Poisson distribution when applied to this sort of data. 
It may also have been observed that, the standard deviation being 
1-15, the variance was therefore 1-32. This variance is almost the same 
as the mean value of 1-4, and this coincidence of the average and the 
variance is the hall-mark of data which fit the Poisson distribution. 

Apart from such a study of isolated events in time it is also possible 
to analyse in this way isolated events in space or distance. For ex- 
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ample, it may be desirable, when studying the impact of transport 
facilities on industrial location, to consider the relative frequency 
with which industrial premises occur in close proximity to railway 
stations. This could then perhaps be compared to the frequency with 
which such premises occur near port facilities and trunk road junc- 
tions. Such problems of comparison will be considered later in 
Chapters 8-10. In the case of the railway stations, a count could be 
made to discover how many industrial premises occur near each of 
a series of sample stations; the method of choosing the stations to 
be studied will be outlined in Chapter 7. For now, assume that the 
following figures were obtained. 


No. of industrial 
premises near 


No. of stations that station 
182 0 
91 1 
23 2 
3 3 
1 4 


In this example there are 300 stations and the average number of 
industrial premises per station is 0-5. Further calculation will show 
that the variance of this set of data is 0-503. As this is virtually the 
same as the mean it is therefore possible to use the Poisson distribu- 
tion to make an assessment of the probability with which given num- 
bers of premises will occur near each station. The normal curve and 
the binomial distribution could not be used in this case, for with a 
standard deviation of 0-71 any probabilities obtained in that way 
would underestimate the occurrence of few premises and overesti- 
mate the frequency of many premises. The Poisson distribution 
values can be obtained as follows: 
No. of occurrences 


Term Value Probability Calculated Observed 
e* == 2-7183-05 =0:6065 = 181-98 182 
z.e-® =05 X 2-7183-"5 = 03032 = 90-97 91 

2 52 
ee see = X 2:7183-°5 = 0:0758 = 22-75 23 

3 ee 
ae = a x 2-7183-°5 = 00126 = 3-79 3 

4 44 
oreo = a x 2-7183-°5 = 00016 = 0-47 1 
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The close relationship of these values to the actual ones is clear. In 
most cases, of course, the relationship is nowhere near as marked, 
the variance not being as close to the mean as it was in this case. 
When the variance differs too greatly from the mean it is still pos- 
sible to use the Poisson distribution by means of an adjustment to 
the formula, but this can be followed up by the reader in more 
advanced texts if he so requires. 

Throughout this and the previous chapter the average, standard 
deviation and variance values, the methods of calculating which were 
outlined in Chapters 2 and 3, have been put to some practical use 
beyond the simple representation of the basic parameters of a set of 
data. Especially they have been employed in the assessment of the 
probability with which given conditions may be expected to occur. 
In order to do this it has been shown to be necessary to allocate the 
data to one of several distribution curves, the one chosen being partly 
conditioned by the character of the data and partly by the problem 
that it is desired to solve. In all cases, however, the aim of assessing 
probabilities has been to obtain from a limited set of data informa- 
tion of what is likely to occur throughout a much larger—in fact, an 
infinitely larger—set of data. This limited set of data is what is known 
as a ‘sample’ of the larger body of data. As it is so useful to be able 
to obtain an assessment about conditions in a large body of data by 
analysing a relatively small body, and also as it is often the case that 
only a ‘sample’ of conditions is in fact available, it is therefore essen- 
tial that the characteristics and limitations of working on sample data 
be understood. That is the purpose of Chapters 6 and 7. 
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CHARACTERISTICS OF SAMPLES 


Sample and Population Parameters 


In most of the methods so far outlined there is the implied assump- 
tion that the values obtained, especially in terms of mean and devia- 
tion, apply to an infinitely long series of data. This long series of data 
is referred to as the population, and the parameters obtained are thus, 
for example, the population mean and the population standard devia- 
tion. More concisely, at times these may be called the true mean etc., 
this term thus implying that it is the value which would be obtained 
from analysing the whole body of data concerning the phenomenon 
under study. The values that in practice are obtained are usually 
based on only part of the body of data, this being the result either of 
a deliberate choice or because no more data are available, i.e. these 
values are based on only a sample of the conditions. Thus what is 
usually obtained is not the true or population mean but the sample 
mean; the same applies to the standard deviation too. Before pro- 
ceeding to any assessment of the differences between different series 
of data, or to any further conclusions based on the mean and the 
standard deviation, it is therefore essential that some thought be 
given to the relationship between these sample parameters and the 
true parameters. 

The relationship that may be expected to hold true between sample 
and population parameters is partly conditioned by the size of the 
sample and partly by the method of obtaining the sample. Ideally the 
choice of sample would be purely random, i.e. without any bias what- 
soever, and simply as a chance selection of so many items out of the 
body of data. The means by which a random choice may be made 
will be outlined in Chapter 7; suffice it to say at this stage that such 
a sample should give as true and representative a cross-section of the 
population as is permitted by the size of the sample. In many cases, 
however, especially in geographical analyses, such a random selec- 
tion is either not possible or not desirable for other reasons. The 
general concepts on which sampling techniques are based are 
nevertheless best explained in terms of random sampling, and the 
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modifications necessitated by non-random samples can then be 
presented afterwards. 

Given that the sample is a random one, the major factor control- 
ling the relationship between sample and population values is thus 
the size of the sample. The influence of this can probably best be seen 
if a slight digression be made to consider again the frequency dis- 
tribution curve of a normal distribution. In Fig. 21 the curve for the 
individual items of a set of data (i.e. when n = 1) is the lowest and 
most broadly based of those which are shown, while the average ob- 
tained from these individual items is shown as being centrally placed. 
Suppose, however, that instead of considering individual items, these 
data were first grouped arbitrarily into groups of 3 items each, i.e. 
that samples of 3 items each were 
obtained by random sampling, 
and that the average were to be 
obtained for each of these samples 
of 3 items. It would then be pos- 
sible to plot a distribution curve 
for these ‘means of 3 items’ and 
an overall average value obtained. 
This average would be the same 
as that for the individual items, 
but the curve would differ. When 
taking the samples it would be 
unlikely that in all cases all three 
items in the sample would lie on the same side of the average. 
With the averaging of these 3 items the likely range of values of 
‘means of 3 items’ would be less than that for individual items, so 
that the values would cluster more closely around the average. 
So although the average of the ‘3 item sample’ data would be the 
same as that of the individual items, its variance and standard 
deviation would be less. This is shown diagrammatically in Fig. 21 
by the second lowest of the curves. This lesser degree of scatter of 
sample means than of individual values around the average applies 
no matter what size of the sample is taken. On the other hand, the 
greater the number of items in the sample means, the smaller will be 
the scatter of these sample means, as is shown in Fig. 21 by the top- 
most curve for ‘10 item samples’. This indicates that the variance of 
these distribution curves based on sample means is related to the 
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Figure 21. Distribution curves of 
sample means of n items 
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number of items in the sample. This relationship is expressed as 
follows: 


variance of sample means _ variance of individual items 
with n items per sample number of items per sample 


; o 
or more briefly: var., = — 
n 


Furthermore, as the standard deviation is the square root of the 
variance the standard deviation of sample means with n items per 
sample can be obtained as follows: 


lc? 
Os os V Varn = a 

novn 

i.e. the standard deviation of a distribution of sample averages is ob- 
tained by dividing the standard deviation of the individual items by 
the square root of the number of items in the sample. 


Sampling or Standard Error 


The greatest value of this relationship to sampling procedure lies 
in a corollary from the above argument. If the distribution curve for 
the ‘means of samples of 10 items’ is considered separately in Fig. 21 
it is seen that it is a normal curve symmetrical about an average value 
which is the same as the average value for the overall data given by 
the individual items. It can therefore be argued that, because of the 
characteristics of the normal distribution, it is extremely improbable 
that any one ‘mean of a sample of 10 items’ will differ from this over- 
all average by more than 3 standard deviations, i.e. by more than 


o 
i) and that it is unlikely that it will differ from this overall 
V 10 
average by more than 2 standard deviations, i.e. by more than 


oO . 
| If this is so, the reverse argument can also be applied, 


V10 
namely that if any given ‘mean of a sample of 10 items’ is known then 
the overall or true mean is unlikely to differ from this sample mean 


oO . . 
by more than ea and it is extremely improbable that it will differ 
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oO 
from this sample mean by more than (<a) Thus, if a sample 


mean is obtained it is possible to indicate the limits within which the 
true mean must lie with a given percentage probability, i.e. 
oO 
the true mean X = the sample mean ¥ + /— 27) with a 95-45% 
n 


probability; 


Vn 
In most cases the true mean will lie closer to the sample mean than 
these values, for these only indicate the limits beyond which it is 
unlikely that the true mean will lie. 

An example of this sort of application will help to stress its im- 
plications. Suppose that a study is being made of farming over a large 
area and an assessment is required of the average size of farm hold- 
ings. The total number of farms is so large that it is decided to study 
only a sample of these farms. Provided that this sample is truly ran- 
dom, picked in a way that will be outlined in Chapter 7, it would be 
possible to assess the limits within which the true mean should fall 
with a known percentage probability. The accuracy of this or of any 
sample is, as indicated above, related to the size of the sample, and 
thus not to the percentage of the total data which is included in the 


sample. Given that the variance of the sample mean is expressed by 
2 


eee : olde: . 
— it is clearly the magnitude of n which is important, whether this 
n 


oO 
or = the sample mean ¥ +/— sea) with a 99-7% probability. 


be 90% or 9% of the total of occurrences. In the present example it 
could be that a sample of 200 farms is to be taken. From these it is 
found that the sample average acreage is 90 acres and that the sample 
standard deviation is 7 acres. The calculation of the limits of the 
true mean is thus as follows: 


no. of items (n) = 200 sample mean (%) = 90 


sample standard deviation (indicated by s rather than c) = 7 


true meani—sxe 


a. S 
Thus, X¥ = ¥ +/— oe with a confidence limit, ie. a percentage 
n 


probability of being correct, of c. 68% 
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7 
wan = 90 +/— 0:5 (actually 0-496) 
ie. X lies between 89-5 and 90:5 with a c. 68% probability. 


= 90 +/— —— 


. s 
Again, X = ¥ +/— OR with a confidence limit of c. 95% 
n 


= 90 +/— (2 x 0-5) = 90 +/— 1-0 
i.e. X lies between 89-0 and 91-0 with ac. 95% probability. 


= s 
Further X¥ = ¥ +/— 3. ie with a confidence limit of 99-7% 
n 


= 90 +/— (3 x 0-5) = 90 +/— 1:5 
i.e. X lies between 88-5 and 91-5 with a 99-7%, probability. 


Thus it can be seen that limits can be set to the true mean value, and 
that these limits are wider the more stringent are the probability 


s 
values. This value which controls these limits, i.e. regs. is known in 
n 


this connection as the Standard Error of the Mean. 

Although this does provide an estimate of the limits of the true 
mean, it equally stresses the limitations implicit in a sample mean 
even when it is based on a sample as large as 200. If a sample ten 
times as large were taken, i.e. if m = 2,000, it would be found that 
the standard error of the mean (S.E. %) equals 0-157 acres instead of 
the value of 0-5 acres based on 200 items. Thus by a sample ten 
times as large the ‘error’ is reduced to about a third of its size, and 
the limits of the true mean could then be set as being between 89-53 
and 90-47 with a 99:7% probability. It can here be seen that to alter 
the probability limits for these values from approximately 68% to 
99-7% requires a tenfold increase in the size of the sample (and the 
work associated with it!). Much of the art of sampling lies in choosing 
a size of sample that will give an answer with the desired degree of 
accuracy and probability with the minimum sample size. However, 
if a certain degree of accuracy is required it must necessarily mean 
a certain sized sample—there is no satisfactory way of getting an 
adequate answer with an inadequately sized sample. 

A comparable sort of standard error can also be obtained for the 
standard deviation. This Standard Error of the Standard Deviation is 
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s 
obtained by the expression —=, from which the degree of uncer- 


V2n 
tainty inherent in the estimate of the standard deviation from a 
sample can be obtained. So in the farm acreage example above the 
true standard deviation can be assumed to lie within the following 
limits with the following degrees of probability: 


is 
true standard deviation o = sample standard deviation s +/— oe 
n 


with a 68% probability 


7 
ie. 0 = 7 +/— ————— = 7+ /— 0:35 
V2 x 200 
i.e. the true standard deviation lies between 6-65 and 7-35 with a 68% 


probability. 

Similarly it would be found that the true standard deviation lies 
between 6:3 and 7:7 with a c. 95% probability and between 5-95 and 
8-05 with a 99-7% probability. Again, if the sample were to be in- 
creased to 2,000 items then the true standard deviation would be seen 
to lie between 6-67 and 7-33 with a 99-7% probability. The accuracy 
of these statements can be readily checked by the reader by calculat- 
ing the standard error of the standard deviation on the basis of 
2,000 items, a sample mean of 90 and a sample standard deviation 
of 7. 


Best Estimates, Small Samples and Small Populations 


In all these calculations of standard errors which have so far been 
presented one assumption has been made which is not strictly justi- 
fied. From the diagram in Fig. 21, the mean value and the standard 


oO 
deviation value which led to the expression that o, = a (p. 75) 
n 


were the mean and standard deviation of the total population. In the 
above samples, however, it is the mean and standard deviation of 
only the one sample which is used. This is often done through sheer 
necessity for only the sample data may be available. Nevertheless, in 
order to be able to apply the method of obtaining the standard error 
with some justification, an estimate should be made of the true stan- 
dard deviation. This process is usually referred to as making a best 
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estimate, and it is done by applying a correction to the sample stan- 
dard deviation. This correction, which is sometimes called Bessel’s 


Correction, is git = ; for changing the sample standard deviation 
4; — 


: oe ae n 
to the best estimate of the standard deviation, and it is i for 


changing the sample variance to the best estimate of the variance. 
There are thus three possible values for which the term standard 
deviation is used, and each has its own symbol. There is the sample 
standard deviation (s), the true or population standard deviation (oc), 
and the best estimate of the standard deviation (6)—such a circumflex 
over a sign always indicates a ‘best estimate’. 

It is possible to apply this correction to the values used in the 
previous example. The sample standard deviation in that case was 
n 


7. This must therefore be multiplied by 


A eh i 
o=S. 
n—] 


200 
zs 9 etl 
es yf rt 


= 70175 
This best estimate of 7-0175 can therefore be inserted in the calcula- 
tion of the standard error of the mean which becomes 


n— | 


The difference between this and the value of 0-496 on p. 77 is negli- 
gible, because of the size of the sample. It is clear that the larger the 


F n 
sample the closer will the correction es ' approximate to unity, 


” _ will be considerably 


while if the sample is small the value of wf 


above unity and will therefore markedly affect the size of the standard 
error. This is but one of the problems associated with small samples, 


which will be examined further in later pages. 
This extra calculation of the best estimate of the standard deviation 
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: ’ ; 
can in fact be avoided by integrating the correction factor - aie 
n — 


into the standard deviation formula. The correctness of this integra- 
tion can more clearly be seen if it is first effected for the variance 
rather than the standard deviation. So, if the sample variance is s? 
the conversion to the best estimate of the variance (6?) is made as 
follows: 


" n 
oF = s? xX 


Oma ls 
a(x — x)? n x (x — x)? 
= DX oe —--— ~ 
i va n— 1 
As the standard deviation is the square root of the variance it follows 


that the best estimate of the standard deviation may be obtained from 
a sample by direct calculation from the formula 


Thus the calculation in the above example would be 


aa ie = V49:246 = 7:0175 
199 

This gives the same answer as by the application of the correction 

after calculating the sample standard deviation (p. 79). As this differ- 

ence between the sample and best estimate values may well be of 

significance at times, it is always desirable, when using a set of data 

as a sample of a larger body of data, either to insert n — 1 for n in 


the standard deviation calculation or to apply the correction , / ie 
wt 


afterwards. 

Although the application of this correction helps to counterbalance 
any underestimate of conditions introduced by a sample which is not 
very large, there is the need when samples are really small for a further 
modification to be made, this time to the actual use of the standard 
error. In small samples it is no longer safe or justified to assume that, 
for example, values will lie within two standard deviations of the 
mean with a 95% probability. In other words, the probability values 
of the normal curve cannot be assumed to apply to the sample even 
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though the full body of the data fits the normal curve. Instead use 
should be made of Student’s t distribution. This will be considered 
more fully in Chapter 8. For now it is sufficient to refer to the graph 
in Fig. 27. For this it is necessary first to obtain the value (n — 1) 
which is here known as the ‘degrees of freedom’ (see p. 125), and then 
to read off against this the ‘r’ value for the required probability level. 
Thus if on the normal curve a 95% probability of values lying within 
the two standard deviation limits would be used, then ‘?’ is read off 
at the 5% level on Fig. 27. The value thus obtained, which will be 
somewhat larger than 2, is then used in the true mean calculations 
instead of the value of 2 itself, when multiplying the standard error 
value. So, whereas with a large sample the limits of the true mean (X), 
defined with a 95% probability, would be obtained from 


“ 6 
X=*x+/—2.— 

asap 
with a small sample the formula would become 
. 6 
X=*X+/-—t. 

al $775 


The same would be true when assessing the true standard deviation, 
which with small samples would thus be 

6 
o=s+/—t. ae 

The differences which these modifications of the formula may in- 
troduce into assessments of the limits of the true mean and standard 
deviation can most readily be appreciated if one set of sample data 
is analysed by the several methods outlined above and the resulting 
assessments are then compared. As a practical example, assume that 
a study is being made of the number of people in a series of parishes 
or communes over a large area. The total number of units is con- 
siderable, but some reasonable degree of similarity in population size 
etc. can be expected on the basis of prior knowledge of the area. It 
is therefore decided to make a rapid sample analysis of values before 
making a full study, so that any obvious problems can be foreseen. 
For this purpose a sample is chosen at random (see p. 90), totalling 
as few as only 10 communes. From this sample the following para- 
meters are calculated: 


number of items (7) = 10 communes 
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sample average (x) = 350 people per commune 
sample standard deviation (s) = 15 people 


From these it would be possible to calculate the limits of the true 
mean with a 95% probability of being a by the formula 


Papp Bia 


X (with a 95% probability) = * +/— 2.—= = 350 +-/— ans 


C 
= 350 +/— 57, = 350 +/— 95 = 3405 to 359°5 


This, however, fails to take into account the fact that only the sample 
standard deviation is being used and that the best estimate of this 
parameter should in fact be employed. This is therefore obtained as 
follows: 


best estimate of standard deviation (6) 


=5f 2 =15 x / 9-15 x vit 


= 15 X 1-055 = 15-825 


With & inserted for s the assessment of the limits of the true mean 
becomes 


=350 tye 2 x 15-825 


X (with 295% probability) = ¥ +/— 2.— 316 


7 
= 350 +/— 10 = 340 to 360 


Such an assessment is strictly only applicable if the sample is a large 
one, but in this case it is small (this term frequently being taken to 
imply 10 items or less, although the methods are often applied to 
rather larger samples too, to be on the safe side). Therefore the fre- 
quency values of the normal distribution should be replaced by those 
of the Student’s ¢ distribution. Referring to Fig. 27, it is first neces- 
sary to obtain what are called the ‘degrees of freedom’, i.e. n — 1, 
which in this case is 10 — 1 = 9. The value for ¢ for 9 degrees of 
freedom is then read off at the 5% level, this being virtually the equi- 
valent of the 2 standard deviation probability on the normal curve. 
This gives a value for ¢ of 2-4 at this 5% level. It is this value which 
must now replace the 2 in the formula. Thus, bearing in mind the fact 
that this is only a small sample, plus the need to correct in terms of 
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the best estimate of the standard deviation, the limits of the true 
mean, with a 95% probability of being right, are: 

6 : : 
— = 350 +/— 2-4 x 15:825 
Vn 


X (with a95% probability) = ¥ +/—t. ST teh 


= 350 +/— 12 = 338 to 362 


In this way it can be seen that the limits of the true mean are in 
fact wider than might be assumed at first, and it is the latter set of 
values which should be used. In terms of the present example it means 
that by considering only ten communes, and assuming that these are 
representative of the whole data, the overall average (i.e. true mean) 
population per commune or parish can be assessed within reasonable 
limits, i.e. it will almost certainly lie between 338 and 362 persons. 
Such an assessment can well provide a useful indication of the order 
of magnitude within which working will take place, and also of the 
order of detail that may be required to enable significant differences 
to be appreciated. Furthermore, this example also indicates that the 
closeness with which sample values will approximate to true values 
is controlled by several variables. The difference between sample and 
true values will increase as the stringency of the percentage prob- 
ability of being right is increased, as the standard deviation increases 
and as the size of the sample decreases. As the second of these vari- 
ables is inherent in the body of the data, it is only in the first and the 
last that there is some element of conscious choice. This choice is 
exercised in terms of the character of the analysis, its purpose, and 
the degree of accuracy required. 

Before considering how a decision can best be made concerning 
the most suitable size of sample, one further theme must be outlined. 
Suppose that when sampling it was found that the size of the total 
population was very small, in contrast to the earlier examples where 
it was the sample size that was small. In such a case the best estimate 
of the standard deviation would be virtually the true value, i.e. 6 = o. 
Therefore the standard error would be less than the usual formula 
would suggest, and the standard error calculated in the normal way 
must be modified by a factor related to the proportion of the popula- 
tion forming the sample. This proportion is known as the ‘sampling 
fraction’. The factor used is V1 — f where fis the sampling fraction. 
This means that if all the population were to be included in the 
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sample, then the sampling factor f would be unity, the correction 
factor would be 0 and therefore the standard error would also be 0. 

With this factor added, the standard error of the mean for a 
random sample of a small total population is 


a, a fo Ce ae 8 
SE%=—--V1—f or (evi —f or me —f) 


So if it were found from a small total population that 6 = 40, that 
the number of items in the sample, ie. 2 = 4 and that the sampling 
fraction f = 4, the standard error would not be 


0 ie SP ae wee —_— 
but 7. Vi —f= 20 x V1 —0-2 = 20 x V08 = 20 x 09 = 18-0 


(approximately) 


In this way the standard error is reduced for the same size of sample, 
but only if the total population is itself not large. 


Specification of Sample Size 


It has been indicated earlier that it is often of very great value to 
be able to judge the minimum size of sample that will ensure that the 
true mean is obtained to within given limits. For example, in the case 
outlined on pp. 81-83 it is considered desirable to establish the true 
mean’s limits with a probability of 95%. Also, from a small sample 
such as ten items the best estimate of the standard deviation has been 
calculated as 15-825. The range within which the true mean lies was 
too wide if only ten items were included in the sample, and it is 
decided that to be able to make any useful general judgments from 
a sample the true mean needs to be defined to within +/—5 of the 
sample mean (at the 95% probability level). The question therefore 
is what size of sample needs to be taken to give this degree of accuracy 
under these conditions; i.e. assuming on the evidence of the 10 item 
sample that for the required degree of accuracy the sample will not 
be a small one, what size of sample will yield a standard error of 2:5? 
The formula for the standard error is thus 
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A 


S 6 
Eo VE 

and this must equal a desired value (d). So 
6G 

yee 
1 d 

vn 6 

pats 
Va=* 


() 

i= Kd 

In the present example, when 6 = 15-825 and d = 2:5, 
15-825 \? 

n= (B=) = 6:33? = 40 items in the sample. 


As a check to show that a sample of that size would give the desired 
result, the following calculation can be made in terms of n = 40. 
6 


X (at 95% probability) = * +/— 2. 7 
n 


2 x 15-825 
ofS foal ve a 
31-65 
a Baa oe =f f= 


= 345 to 355 persons 


This formula for calculating the size of the sample required for given 
conditions can always be applied to data based on random sampling, 
when the population is virtually normal in distribution and when 
some best estimate of the standard deviation has been made. 


Standard Error and Sample Size with the 
Binomial Frequency Distribution 


Of the assumptions mentioned above in connection with these 
calculations of sample size, the one that must be stressed is that of 
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the data approximating to the normal distribution. This may often 
not be the case, however, when probabilities are to be calculated by 
the binomial distribution based on a fairly small sample. On p. 67 
an example of this sort was used. In an area where the overall per- 
centage probability of farms engaging in dairying was 60% a small 
sample of only 3 farms was visited. The resulting probabilities of 3, 
2, 1 or O of these three farms including dairying in their activities 
were set out (p. 67). The frequency distribution for this is somewhat 
skew, as is shown diagrammatically in Fig. 22. If a larger sample had 
been taken then the distribution curve would have been less skew, 
as is shown for a sample of 10 farms also in Fig. 22. This partial 
correction of skewness would have been greater still if some 30 or 


Probability of p occurring 


a given number of times 
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Figure 22. Effect of size of sample on the skewness of a binomial distribution 


40 farms had been included in the sample. In this particular case the 
values of p and g were 0-6 and 0-4 respectively, and in such cases an 
almost normal curve can be obtained with a relatively small sample. 
If these values were 0-1 and 0-9 instead, then a far larger sample would 
be needed to give a near-normal curve. 

In all such binomial distributions, however, the calculations of the 
standard error of the sample mean, or the assessment of the size of 
the sample required, must be effected by slightly different methods 
from those outlined above. A suitable example of this can be provided 
by outlining a problem of assessing the proportion of an area which 
is under irrigation, without having to account for and study every 
acre. The sample data will be in the form of a certain proportion of 
irrigated land and a certain proportion of non-irrigated land, these 
two proportions together giving the total size of the sample. Thus the 
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sample data are characterized by the expression (p + 9)", where p is 
the proportion of land that is irrigated, q the proportion that is not 
irrigated and n the number of items in the sample. The average fre- 
quency of the given conditions, i.e. irrigated land, given by this 
sample may be assumed to be 30%, so that the probability is 0:3. 
Conversely, 70% of the sample area must therefore be non-irrigated, 
its probability of occurrence being 0-7. These are the p and q values 
in the equation. The relationship of the true proportions to these 
sample proportions, however, will depend on the size of the sample, 
which will affect the standard error of the sample value. 

With the normal distribution this standard error is expressed as 


6 

5 ie This is replaced, in the case of the binomial distribution, by 
n 

Ay) NPQ; which expresses the standard error in absolute terms in rela- 

tion to the number of items in the sample. The values in a binomial 

distribution are most readily expressed, however, as a proportion or 

as a percentage. To obtain this the standard error given above can 


100 
be multiplied by F-vies that as a general statement the percentage 
standard error is obtained by 
= 100 
Vnpq x Pe 


If the two component parts of this are each squared (thus giving the 
variance) it can be written as 


100? 
a, 
With a little cancellation, and the reintroduction of the square root 
2 
to give the standard error again, this becomes PL The term 
n 


pq.100? could equally be written 100p.100g, which is the same as 
P%-d%- Thus the formula for the standard error of the sample pro- 
portion, expressed as a percentage, is simply 


In terms of the example specified earlier, the following values will 
obtain for the percentage standard error of the sample proportion of 
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irrigated land. If the sample value of 30% were based on a sample 
of 50 items, then 


ae on ae EEO PTO est proclirae. 
Dae sf n 50 Os . 


i.e. at the 95% level of probability, the true percentage of the whole 
area that is irrigated would be 


30% +/— 2(6-5)% = 30 +/— 13 = 17% to 43% 
If, on the other hand, the sample had consisted of 300 items, then 


- See _ 

SE. % = [Pea Pilo-4 Pith = (2 AOE EE ees 
300 300 

Thus, in this case the true percentage of the land that was irrigated 

would lie, with a 95% probability, between the following limits: 

30% +/— 2(2-65)% = 30 +/— 5:3 = 24-7% to 353% 

a more restricted range because of the larger sample. 

Finally in terms of the random sampling of a binomial distribution 
in this way, it is often of considerable value, after an initial sample 
has been made, to assess the size of sample required to yield a stan- 
dard error of a given magnitude. This has already been outlined for 
the normal distribution on p. 85 and can be calculated here as 
follows: 


10) oO 
SE Ri P% Vo __ d (where d is the desired value for the stan- 
n 


dard error) 
So P%-T%o — d2 


oe 
If the desired value for the standard error is set at 2%, ie. d= 2, 
then the necessary sample size in the irrigation example is 


P%-4% 

ft 

30 x 70 2100 : 
i seerrntns er = 525 (sample size) 
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On the other hand, if a standard error as large as 5% is adequate the 
sample size can be much smaller, i.e. 


_ P%-V% 
a 

30 xX 70 2100 ; 
SS are aay le 84 (sample size) 


The size of sample required to give a standard error of 2%, and 
therefore an estimate of the true proportion at the 95% probability 
level to within + /—4%, may seem rather large at 525. This sample 
size, however, will apply to any size of total population, i.e. in this 
case to any size of area. If a study is being made of irrigated land 
on a large scale, 525 samples is a small price to pay for an estimate 
of the overall percentage value to within these close limits. 
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CHAPTER 7 


METHODS OF SAMPLING 


Methods of Random Sampling 


All these considerations so far made concerning sampling have been 
based on the assumption that the sample itself has been a random 
sample, implying, as was stated earlier (p. 73), that the sample is an 
unbiased and representative cross-section of the body of data. The 
means by which such a random sample is obtained have not so far 
been considered, however. Suppose that a long list of data is avail- 
able, perhaps concerning administrative units or industrial premises 
or climatic conditions, and it is desired to make a sample study of 
these data. This may be either because it is not considered worth while 
to analyse the full set or because a preliminary survey of this sort may 
enable the full study to be made more effectively. If a limited number 
of items were picked because they were considered ‘typical’, or be- 
cause they stood out more clearly than the others, or because they 
were places known to (or near to) the person concerned, then there 
would be no justification for assuming that the conditions in these 
cases would represent the full range of conditions in the whole body 
of data. The sample would be what is termed ‘biased’, i.e. weighted 
in a given direction because of the way in which it was chosen. This 
must be very carefully guarded against, for if a choice of sample is 
made in this or a similar way the resulting values of mean and stan- 
dard deviation conditions, and of related probability and other char- 
acteristics, will apply only to the sample data themselves. There will 
be no really adequate method of assessing the relationship between 
these sample characteristics and those of the population from which 
they were drawn, i.e. the concept of the ‘standard error’ which has 
been outlined above cannot be legitimately applied. 

The choice of the sample should instead be made by reference to 
a table of Random Sampling Numbers, a short example of which, 
extracted from the Cambridge Elementary Statistical Tables, is pre- 
sented in Table XII. Thus if the data consisted of 100 items, listed 
in order of magnitude or in some other way, the first two columns 
of digits in Table XII would be used with the numbers 00 representing 


90 


Table XII 
Random Sampling Numbers 


ZUSUT Pa2N 280823910. 59266 e5Srole 02 0) 28610 51 55™ 92 52 
74 49 04 49 03 04 10 33 53 70 11 54 48 63 94 60 94 49 
9470 49 31 38 67 23 42 2965 40 88 78 71 3718 48 64 
ZIM STON LO O9F84 82252 955204 2152120 °5402" 01537 338, 37 
S329 AZAS 27930 V30S5 91875 50! 5758751, 49136\-12 53 


45 04 7797 3614 99 45 5295 69 85 03 83 51 87 85 56 


nnn nn TO SESE 

is table is extracted from the first part of Table 8: Random Sampling Numbers, 
nD. V. Lindley and J. C. P. Miller, Cambridge Elementary Statistical Tables, 
Cambridge, 1953. 
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100. If a sample of ten items were to be picked then numbers 20, 74, 
94 etc. to 04, 32 on that list would form the sample. Again, if the full 
list were made up of almost 10,000 items then the first four columns 
would be used, again the 0000 representing 10,000. Perhaps in this 
case a sample of 100 items would be decided upon. The first of these 
would then be number 2,017 on the full list, the next would be number 
7,449, the next number 9,470 until 100 items had been chosen. In this 
way no bias would be introduced into the choice, for—to quote from 
the source for the numbers in Table XII—‘Each digit is an indepen- 
dent sample from a population in which the digits 0 to 9 are equally 
likely, that is, each has a probability of 1/10.’ Also, provided that the 
sample is not so small that it cannot incorporate the full range of 
conditions in the population, a choice such as this should provide a 
balanced cross-section of the population conditions—unless, of 
course, there are really extreme conditions which occur very infre- 
quently. If this is known or found to occur then a rather different 
method of choosing a sample must be used, as will be outlined below. 

In the examples just considered the total population came to the 
same number as the possibilities involved in the number of digits. It 
is more often the case that this is not so. For example, the population 
may total 2,000 items, and therefore four digits must be used for the 
random numbers. When this happens there are two possible ways of 
using the random numbers. One method is simply to accept the ran- 
dom numbers which are obtained up to 2,000 and reject (i.e. ignore) 
those numbers which are obtained between 2,001 and 9,999, carrying 
on with this until the sample of 100 items is obtained between 1 and 
2,000. This can be quite a lengthy process, a very high rejection rate 
being likely in this example. Instead it is possible to rephrase the 
numbers above 2,000 as repeats of the 1 to 2,000 series, i.e. numbers 
2,001 to 4,000; 4,001 to 6,000 etc. can each be taken as a fresh series 
of values of 1 to 2,000. Thus all the numbers are used and much time 
is saved. Another occasion when the renumbering of data is con- 
venient is if the data are available in a series of groups yet it is desired 
to obtain an overall sample rather than a sample of each group. For 
example, data may be available concerning the numbers of inhabi- 
tants in a large number of settlements. One group of these settlements 
may be small villages and are returned as such. Another group also 
returned separately may consist of large villages, another of small 
towns, and yet another of larger towns. Although it is possible to 
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consider such data in a different way, as will be outlined below, it 
is also possible to take one sample at random from the whole of the 
settlements together. To use the table of random numbers in such a 
case, the numbers must be made to run consecutively through the 
whole population. So if the first group consists of 255 settlements 
these can be numbered 1 to 255; if the second group contains 176 
items these should be renumbered 256 to 431; if the third group con- 
sists of 87 values then these should be renumbered 432 to 518; while 
the fourth group, totalling 18 values, would become 519 to 536. A 
random sample can then be obtained in the way outlined above. 

Apart from this selection from data set out in list form, random 
sampling methods and techniques can also be applied to data which 
have an areal distribution. In many geographical problems the draw- 
ing of samples from within data distributed in space is an essential 
part of the analysis of the characteristics and qualities of those data. 
It is often in studies of this sort that there is a great temptation to 
select the items which are to form the sample. For example, in a study 
of agriculture, ‘type’ farms are selected for more detailed study be- 
cause they are known or assumed to represent certain characteristics, 
or because they are farms for which extra or more accurate informa- 
tion is available. Although this may well give a clear picture of a par- 
ticular farm, it does not allow generalizations to be made about farm- 
ing in the area as a whole except by subjective extrapolation. With 
an experienced and highly qualified research-worker this may be done 
with a very high degree of accuracy and validity. Any errors that are 
introduced, however, may be obscured by the treatment, while every 
worker in the field could very easily arrive at a different answer from 
every other one as a result of differences inherent in the approach 
adopted. 

Areal sampling by random numbers requires that first of all the 
area under study should be ‘gridded’. In many cases such a grid is 
already available from the base maps for the area. The grid, whether 
already on the map or added afterwards, can then be numbered as 
is the National Grid on the Ordnance Survey maps of Great Britain, 
i.e. from west to east and from south to north, so that numbers in 
both directions are at zero in the south-west corner of the area, and 
increase steadily eastwards and northwards. These numbers can 
either be made to apply to a grid line or to the space between two 
grid lines. Which is chosen depends on whether the aim is to sample 
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various points or various small areas. For example, if it is desired to 
choose a series of farms for study a ‘point sampling’ would be neces- 
sary. If there were no more than 100 grid lines in each direction then 
the first four lines of digits in the table of random sampling numbers 

could be used (or, to yield 


: a finer net, 6-figure groups 
e Boo Meee could be used in fhe San 
3 a2 way as 6-figure grid refer- 
2 oe . ences). If only four values are 
F 45 used, the first two of these 
0 would give the ‘easting’, i.e. 
Pe WIRE OS Pe the number of grid lines east 


(a) Point Sampling 


of the ‘point of origin’ in 
the south-west corner, while 


aa BAND O nk M eer ethic pees bat of sisi 
3 ail 22 would give the ‘northing’ 
oi ee Oa from this point. The point 
ue ee feat 18 2 thus arrived at will then 
0 | | GF | designate the farm to be 
Qt 2 53g 47, Sli} included in the sample (Fig. 

i) ‘ 2 

(b) Area Sampling 23a). The farm can be speci- 

5 fied either as the one in whose 

h RANDOM SAMPLE land this point lies, or the 

4 y one whose farmhouse lies 

nearest to this point. The 

4 former will tend to give an 

: over-representation of the 

Cees at ae number of large farms, but 

(¢) Linear Sampling a true representation of the 


Figure 23. Methods of random sampling for amount of land held by large 
an areal distribution farms; the latter will tend to 


over-represent the small farm 

in terms of the amount of land that falls under small farms, but to 
give a true representation in terms of the number of small farms. In 
such a case some ‘stratification’ of the sample (to be discussed 
below—p. 99) may be desirable, but the principle remains the same. 
If instead of farms it is land use for which the sampling is being 
carried out, then a method choosing a series of small areas might be 
preferred. In this case the numbering of the grid could apply to the 
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space between the grid lines (Fig. 23b(i)). Again the table of random 
sampling numbers would be used and a sample of small areas, within 
which land use could be plotted rapidly, would be provided. Another 
means of choosing the areas would be to keep the numbering to the 
grid lines, and to choose the square to, say, the north-east of the 
sample point as the sample area (Fig. 23(ii)). Yet a further possibility 
of areal sampling is by line samples (Fig. 23c), this proving invaluable 
for use with binomial distributions such as the irrigation problemcon- 
sidered on pp. 86-89. In this, only eastings or northings are needed, 
and the grid line from this point forms the sample item. Along this 
line the distances possessing or not possessing given characteristics, 
e.g. irrigation, are measured, these values providing the sample data. 

In all these cases it has been implicitly assumed that the overall 
area is rectangular in shape and that the grid system therefore can 
fit it exactly. Often this is not so, especially if the overall area is 
some administrative unit. Even so, a rectangular grid should still 
be used, ensuring that it provides a full cover for the area. Then if 
any of the co-ordinates provided by the random numbers lie outside 
the area under study they should be rejected, as was done with those 
numbers which fell beyond the limits of listed data (p. 92). Also, of 
the three basic methods indicated in Fig. 23, and described in the 
foregoing pages, the sample based on areas (Fig. 235) clearly gives 
a larger sample, but it involves much more work in plotting and cal- 
culation. The point sample (Fig. 23a) gives a relatively thin sample, 
but for the work involved the returns are high. As for line sampling 
(Fig. 23c), the labour, though more than for point sampling, is never- 
theless easy, and it gives a coverage much closer to that obtained by 
sampling small areas. 

By these various methods, which differ but little from each other, 
a sample that is strictly random can be obtained from any popula- 
tion, whether this be in the form of a list or of an areal distribution. 
The purpose of such sampling may simply be to choose certain units 
for study, these then being described and explained. This, however, 
is largely a waste of the techniques of sampling, for the data provided 
by the sample allow further conclusions to be drawn concerning the 
whole population. The mean and standard deviation of the sample 
can be obtained in the ways outlined earlier, and from these the 
sampling standard error can be calculated. Due allowance must here 
be made for the size of the sample or for the size of the population, 
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again in the ways that have already been explained. In other words, 
the methods that have previously been considered in Chapter 6 are 
directly applicable to sample data obtained by the random sampling 
methods just outlined. 

A specific example of this may prove of help. Suppose that it were 
desired to estimate the relative balance of various types of land-use 
over the part of Britain represented on the L.U.S. Sheet 13, Kirkby 
Stephen and Appleby. This could well be done by a spot sample, as 
is indicated in Fig. 24. In this example a sample of 100 units was 
taken, each being obtained by a 6-figure grid reference picked from 
a table of random numbers. Fig. 24 both locates the sample points 
and shows the land-use (arable, grassland, woodland or moorland) 
at these points when the land-use survey was made. Also, the overall 
distribution of moorland is shown, with which the random sample 
can be compared. From the sample points the following frequencies 
were obtained which, as the sample was of 100 units, also represent 


percentages. 
Arable Grassland Woodland Moorland Total 
8 31 6 55 100 
Oo 10) 
Furthermore, by using the formula on p. 87, i Be the 
n 


standard error for each of these estimates can be calculated. Thus 
for arable the S.E. 


iE aa PES pero 
100 100 


so that the limits of the true percentage of the area under arable (with 
a 95% probability of being correct) are 


8% +/—258.E. =8 +/— 22-7) = 8 +/— 5:4 =2-6% to 13-4% 
The standard error for the other three types of land-use are: 


Limits of true percentage 
S.E. (at 95% probability) 


grassland = iE = = V21-4 = 4-63% 31+/—9-26 = 21-74% to 40-26% 


poodien = ce orn = V564 =24% 6+/-48 = 12% t010-8% 


55 AL BE 
moorland = 4 = = V24-75 = 497% 55+/—9-94 = 45-96% to 64:49%, 
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Figure 24. Random point sampling of land-use over L.U.S. Sheet 13, Kirkby Stephen and Appleby 
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It would, of course, also be possible to translate these percentages 
into absolute values for the area under review. As the overall area is 
532 sq. miles, it could be said that the following areal limits apply at 
the 95% probability level 


arable 14 to 71 sq. mls. (42-5 sq. mls. from sample) 
srassland 116..to 214 sq. mis. (165: —|;) 3; = ss 2 ai 
woodland! — 6:5 to.57'5:sq. mls. (32 es. eel 
moorland 240 to 345 sq. mls. (2925 ,, ,,  ,, eae) 


In this and similar ways it would be possible to assess the land-use, 
agricultural economy, population, industrial development or any 
other feature of each of a series of administrative units, the values 
being required for later comparison (see Chapters 8-10). Each unit, 
be it parish or county, could be studied by means of a random sample, 
either from a list of data or from areal distributions. The sample 
means and sample standard deviations thus obtained can then be 
used as being representative of the whole unit, once they have been 
duly modified by the standard error or multiples thereof. 

True random sampling of this sort is frequently possible in geo- 
graphical problems, but equally there are many occasions when this 
is not so. The most common reason for this is that all too often only 
part of the population data is available. For example, rainfall records 
may only exist for some 30 years; historical data concerning medieval 
land-use may be only partially extant; data on industrial production 
or trade may be partially unavailable for security or business reasons. 
In such cases the total available data is but a sample of the total 
population, and furthermore it is, at least in part, a biased sample. 
So in the above examples, the rainfall data are biased in favour of 
one particular period, usually the recent past; the preservation of 
records may itself reflect some aspect of land tenure which encouraged 
maintenance of records and this land tenure may in turn control the 
land-use; the industrial or trade data which are unobtainable may 
fall into this category just because they are so important, the available 
data referring to markedly less important aspects. There are thus 
severe limitations in employing such data as samples from which 
characteristics of the total population can be assessed. Whether or 
not this can be legitimately done can often only be decided in the 
light of other information. For example, if it is known that no signi- 
ficant climatic change has taken place over a prolonged period of 
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time, then the 30 years of records can well be employed as a random 
sample and assessments based on them accordingly. Again, if it is 
known that the historical data are all for one homogeneous type of 
land tenure and that the records are for areas distributed fairly uni- 
formly over the total area, then it may be reasonable to use these 
records as a sample (almost random in character) of land-use under 
a given land-tenure system. In this connection, whether or not these 
records are distributed reasonably in relation to quality of land, for 
example, could first be tested by the y? Test, which will be presented 
in Chapter 10. In all these cases, of course, it would be quite legitimate 
to sample from the available data, provided always that any con- 
clusions are only related to these available data and not to the total 
population (unless other evidence, as suggested above, also exists). 


Methods of Stratified Sampling 


At times, however, it may be more valuable to analyse a body of 
data in a rather more complex form. In the example on p. 93 to 
illustrate the consecutive numbering of items for random sampling, 
the data were stated to fall into several groups. These data concerned 
settlements that were grouped according to whether they were small 
villages, large villages, small towns or large towns. In such a case it 
may be desirable to assess mean conditions etc. not only for the over- 
all body of data but also for the individual groups separately. Such 
a grouping of the data, with a sample picked from each group, gives 
rise to a stratified sample, and each group that is sampled is referred 
to as a stratum, i.e. the data, and also the sample, are divided into 
layers or strata. The analysis of such a stratified sample proceeds in 
the same way as in the examples outlined earlier, only in this case 
each stratum is sampled by a random sample. To begin, it can be 
assumed that the proportion of each stratum forming the sample is 
the same in each case, i.e. that there is a uniform ‘sampling fraction’. 
Random sampling of the total population will yield a close approxi- 
mation to this, for a random sample tends to select a number of items 
in each stratum proportional to the size of that stratum. The data can 
be set out in the following way, and it could consist of any one of a 
variety of aspects of these settlements. Here it can be simply the 
number of garages serving the settlements, and the sampling fraction 
(/) may be taken as 1/10. 
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The number of units in the sample, i.e. the number of settlements 
studied, is 10°% of the total number of units for each stratum. These 
values are taken to the nearest whole number upwards, to ensure that 
at least 10% are included. As a result, the estimates of the number of 
units differ from the known number, and the working that follows 
is related to these estimates—in many studies actual numbers are, in 
fact, not known. The number of units in the sample is shown in 
column (b). The number of garages serving each settlement is ob- 
tained, and the total of such garages for each stratum sample is 


Table XIII 


Tabulation for stratified random sampling with uniform sampling 
fraction 


No. of units, Sample 
i.e. settle- Sample mean of Total no. Estimated 
ments, in totalof garagesper ofunits total of 
Strata sample garages unit in strata garages 
@) (6) 1) (d) (e) (g) 
c/b bf e.d 
(i) small villages 26 39 [R55 260 390 
(ii) large villages 18 36 2:0 180 360 
(iii) smalltowns 9 90 10-0 90 900 
(iv) large towns 2 120 60-0 20 1,200 
overall values 55 285 5-18 550 2,850 
ab ZC Zc/Xb ae ag 


entered in column (c). The sample mean for each stratum is then 
column (c) 
column (5) 
allow an estimate to be made of the total number of units in each 
stratum (although in the present case this is already known), and also 
an estimate of the total number of garages in each stratum. These 
values are entered in columns (e) and (g) respectively. Moreover esti- 
mates can be made concerning the overall body of data. Thus the 
estimated overall average number of garages per settlement is ob- 


obtained by and entered in column (d). These values 


; Ls iC hig 0 ass Mi ctig oS 
tained by SB which in this case is eA 5:18. The overall popula- 
tion total, i.e. the total number of garages, can be estimated by 
100 
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multiplying the sample total by the raising factor (rf). This latter is 
the inverse of the sampling fraction (f), such that in this case with 
J = 1/10 then rf = 10. This method (i.e. 5 c.rf) would give an esti- 
mate of the overall population total of 285 x 10 = 2,850. If, on the 
other hand, the actual total number of units is already known, then 
the overall population total can also be obtained by multiplying this 
value of the units by the estimated overall mean. The fact that the 
sampling fraction will not be exactly the same in every stratum means 
that the answer in this case will differ slightly from 2,850, which was 
obtained by the standard method. 

These estimated means and totals, whether they be for strata or 
the full body, are based on samples and therefore it is necessary to 
calculate their standard errors. To do this, the standard error must 
be obtained for each stratum separately (this is usually required as 
part of the study anyway) and then the standard error for the overall 
values obtained from the strata values. The calculation of the stratum 
standard error is carried out in the way outlined for random samples, 
bearing in mind the need to make the ‘best estimate’ of the standard 
deviation (p. 79) and to use Student’s ¢ instead of the normal dis- 
tribution for assessing the limits of the mean when the sample size 
is small (p. 81). Also, in a study such as this, the population is not 
one of infinite size even theoretically, but rather it is a finite popula- 
tion. For this reason the error involved in sampling will probably be 
less than in the case of an infinitely large population. Therefore it 
can be regarded as a ‘small’ population, and the correction for this— 
which was indicated on p. 84—can be applied to the calculations of 
the standard error. The production of a smaller standard error by 
this method is one of the major advantages in stratified, as distinct 
from unstratified, sampling. 

The calculation of the standard error for each stratum therefore 
proceeds as follows. First the best estimate of the variance of the 


data is made by the use of the formula 
re = (x — x)? 
| 


This is then adjusted, because of the finite nature of the population, 
by multiplying it by (1 — f). Then, to obtain the standard error this 
value is divided by the number of items in the sample and the square 
root found. Thus the standard error is calculated by the third form 
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of the formula set out on p. 84 for use with small (or here, finite) 
populations, i.e. the standard error for each stratum is obtained by 


SE.%= esd if) 
n 


In the example introduced above, this would yield the following 
values. 


Sample Best estimate Calculation of standard error of 


Strata mean of st. dev. the mean 

(i) 1-5 0:5 | x03 x 0-9 = V0-00865 = 0:09 
Giere 20 0-6 yee x09 = V001I8 =013 
(iii) 10-0 3-0 ae x09 =V09 =0-95 
(iv) 60:0 10-0 Jf@ x09 "x 09=V450 =67 


These standard errors must then be applied to the strata sample 
means so that the limits of the strata true means can be assessed with 
given probabilities. In this connection it must be remembered that 
with a small sample the values for the normal distribution must not 
be used but rather Student’s ¢ distribution must be introduced (p. 
81). In the present example the third and fourth strata are repre- 
sented by only small samples, and therefore this adjustment must be 
made in these cases. The limits of the true means for the several 
strata, at a 95% level of probability, are therefore as follows: 


@) X=*£+/—2.SE.= 15 4/— 20-09) “=]{ 15 4/= 018 
= 1-32 to 1-68 
Gi) S27 -8/—3'sh.= OO 4/= 2013) = 20 4/026 
= 1-74 to 2:26 
(ii) ¥ = ¥+/—1£.8.E. =100+/— 2:3(0-95)=100+/— 2-19 
= 7-81 to 12-19 
(iv) ¥=%+/—t.S.E. = 60-0 +/— 12-71(6-7) = 60-0 + /— 85-16 
= nil to 145-16 


The accuracy of the estimates of the true means thus vary markedly 
between the strata, for as was indicated earlier (p. 76) the accuracy 
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of a sample estimate is controlled not by the proportion of the popu- 
lation that it forms but by the number of items in the sample itself. 
This problem will be taken up again later. 

Further problems are the calculation of the standard errors of the 
overall sample mean and of the estimate of the overall population 
total. In both these cases, calculations must first be applied to each 
of the strata to obtain the ‘sample sum of the Squares’, i.e. that value 
which, when divided by n, gives the variance. This is obtained as 
follows. The best estimate of the variance for a finite population is 
6.1 — f (p. 84). To obtain the best estimate of the sample sum of 
the squares this variance must be multiplied by the number of occur- 
rences, i.e. by n (see the method of calculating the variance, p. 22), 
i.e. it is 
n.G?.1 —f 
This is the requisite formula for obtaining the best estimate of the 
sample sum of the squares for each stratum. These separate stratum 
values must then be summed to give the overall sample sum of the 
squares 


i.e. XG? .n(1 —f) 

From this value the standard deviation and standard error of the 
mean can be readily calculated. Thus the standard deviation of the 
overall sample mean involves dividing the sample sum of the squares 
by the number of occurrences in the sample, and finding the square 
root (p. 22), 
iat ue é?.n(1 — f) 


n 


If this is then put above Wn the standard error of the overall sample 
mean is obtained (p. 75), so that this standard error is written as 


ees) = 6?.n(l — f) ce bes 


Vn n n® 
_ VI .n(l —f) 
_ n 


S48 lo f 
It is this latter form, i.e. V2 6?.n(1 f) , which represents the 
n 
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most convenient expression for the standard error of the overall 
sample mean. In practice it is an easy formula to apply, as can be 
seen in Table XIV in relation to the present example. It will be seen 
that values for 6 have here been assigned to the samples for each 
stratum, while the values for n and f are the same as were used in 
Table XII. 


Table XIV 


Calculation of the standard error of the overall sample mean for 
stratified random sampling with uniform sampling fraction 
Strata 6 6? a Pomp) 6*.nll — f) 
(i) 0:5 0:25 26 O01 26 x 0:9 = 23-4 0:25 x 234= 5:85 
(ii) 0:6 0:36 18 O-1 18 x 0:9 = 16:2 0:36 X 162= 5°85 
(iii) 3-0 9:0 9 01 9xX09= 81 90 x 81= 72:90 
(iv) 10:0 100:0 2 01 2x09= 1:8 1000 x 1:8 = 180-00 


= 62.n(l — f) = 264-60 
VE 6.nl —f) 


n 


Standard error of the overall sample mean = 


GA. 16°24 
55 


As the overall sample mean is 5-18 garages per settlement, the true 
overall mean (with a probability of 95%) is 5-18 +/— (2 x 0-296) 
= 5:18 +/— 0:592 = 4-588 to 5-772, i.e. between 4:6 and 5:8 
approximately. 

To find the standard error of the estimate of the overall population 
total (i.e. the estimated total number of garages), this standard error 
of the overall sample mean must be modified. In effect, if this value 
of 0-296 represents the standard error of the average number of 
garages for each (i.e. one) settlement, then it must be multiplied by 
the total number of settlements to give the standard error of the total 
number of garages. This means that it must be multiplied by z to give 
the standard error for the sample total, and then by rf (the raising 
factor) to convert this to the standard error of the population total. 
The necessary formula is therefore: 


tie Vz 6?.n(l — f) 
n 
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calculated with the standard error of the overall sample mean above, 
and in the present example this is 16:24. As the raising factor is 10 
(p. 101), the standard error of the overall population total is simply 
16:24 x 10 = 162-4. The true population value therefore lies, with 
a probability of 95%, within the limits of -+ /— 325 of the estimated 
value of 2,850 (p. 101), i.e. between 2,525 and 3,175. If, as in the 
present case, the actual total number of items is known, this standard 
error can also be calculated by multiplying the standard error of the 
sample mean directly by the number of items, i.e. 0-296 x 536 
= 158-656. This can then be applied to the estimate of the population 
total made from the true number of items (i.e. an estimate of 2,685), 
in which case the true population total will lie between 2,685 + /— 
317, which is between 2,268 and 3,002, again with a 95% probability. 
The overlap between these two definitions of the limits of the true 
overall population total is such that both of the estimates (2,850 and 
2,685) are clearly reasonable ones. 

Finally, by a similar method the formula for calculating the stan- 
dard error of the stratum sample mean can be converted to the 
standard error of the stratum population total. Thus the standard 


A2 
error of the sample mean a eG — f) (p. 102) is multiplied by n 
n 


to give the standard error of the stratum sample total and by the 
raising factor rf to yield the standard error of the overall stratum 
total (p. 104), 


Lea t.Ff.. se —f= nega eh at = Vn.rf.6.V1 —f 
n Vv 
= rf.V6?.n(1 /) 


Thus in the case of stratum (i)—small villages (p. 100)—the standard 
error of the estimated population total of 390 is 
10V/0-5? x 26 x 0-9 = 10V5:85 = 10 x 2-42 = 24-2 
The true population total for that stratum, with a 95% probability, 
therefore lies within the limits 390 +-/— 48-4 = 341-6 to 438-4. Simi- 
lar calculations for the other strata can be made by the reader. 

As set out here, an analysis of this sort may appear both complex 
and confusing. In practice, however, the calculations involved are 
relatively simple, and reliable values are given for many aspects of 
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the study. The formulae for these, discussed and illustrated in the 
previous pages, are briefly set out in Table XV. 
Table XV 


Formulae for use with stratified random samples with a uniform 
sampling fraction 


j 2 
(i) Standard error of the stratum sample 6 —f) (pp. 101-102) 
mean A/ on 


(ii) Standard error of the overall sample VE ¢ 6?.n(1 - = ie ) 
mean at ae 


(pp. 103-104) 


(iii) Standard error of the stratum J AS 
population total fV Gn — f) (p. 105) 
(iv) Standard error of the overall 


population total f VE 6?.n(1 —f) (pp. 104-105) 


Thus by studying only 55 out of an actual total of 536 settlements 
(or an estimated total of 550) it is possible to assess the average 
number of garages serving small villages, large villages, small towns 
and large towns respectively; the average number of garages per 
settlement if differences in size of settlement are ignored; and the 
total number of garages serving settlements of various sizes and in 
the whole area under study. All these assessments are set out on 
p- 100, while these values are all given within specified ranges of prob- 
ability (pp. 102-105). Similar studies of widely varying character- 
istics other than garages could also be made from this sample, so that 
from a relatively small group of settlements a detailed picture could 
be built up which would apply to the whole range of settlements in 
the area. 

A comparable approach could be applied to the binomial distribu- 
tion illustrated by the land-use example presented on pp. 96-98. The 
hundred sample sites can be classified not only in terms of land-use, 
but also into several strata based on altitude, this yielding the follow- 
ing values: 


Ht. Arable Grassland Woodland Moorland Total 
<500’ 3 4 1 0 8 
500’-1,000’ 5 21 5 10 41 
> 1,000’ 0 6 0 45 Si! 
Allheights 8 31 6 55 100 
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The requisite standard errors can then be calculated for any of the 
land-use categories, both for each stratum (i.e. height range) and 
for the overall sample. The moorland category can be taken as an 
example— 


Frequency of Sample 


Ht. moorland total P q S.E. (Vnpg) 
n 

<500’ 0 8 0 1:0 0 

500’-1,000’ 10 41 0:244 0-756 2:76 

> 1,000’ 45 31 O:SS3 0m Oj e229 

Allheights 55 100 0:55 0-45 4:97 


Thus the standard error of the sample frequency is given for each 
stratum and for the overall sample, and this can be readily converted 
to a percentage value if it is required. For example, between 500’ and 
1,000’ the sample frequency of moorland is 10 out of 41 with a stan- 
dard error of 2:76. This could equally be expressed as a sample fre- 


quency of 24-4%, with a standard error of 6-75% | ice. = x 100% 


2°76 
oT ie x 100%), so that the true frequency of moorland between 
500’ and 1,000’ (at the 95% probability level) lies between 10-9% and 


37-9% (see Fig. 24). 


Variable Sampling Fractions 


It will have been noticed, however, that the degree of accuracy in 
the estimates varies between the strata. In the study of garages it was 
rather low for the towns, especially the larger towns, because of the 
small size of the sample. This can be rectified by ceasing to keep the 
sampling fraction the same for each stratum. Instead it is possible 
to use a Variable Sampling Fraction, thus drawing a different pro- 
portion from each stratum. If possible, it is best to vary the sampling 
fractions in proportion to the standard deviation of the data in the 
stratum concerned. It is not always possible or convenient to calcu- 
late this standard deviation accurately and therefore a rough estimate 
is often made. This may be done simply from the range of values 
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involved, or from the mean values, in each case assuming that as 
these increase so does the standard deviation. Clearly this does not 
provide an accurate answer, but it does give the relative order of 
magnitude in most cases. At times, however, the choice of sampling 
fraction is based on other criteria. For example, it may be known 
that in the larger units conditions vary markedly from one to the 
other, possibly to such an extent that each occurrence is a case in 
itself. With an extreme situation such as this it may even be neces- 
sary to study every member of one particular group. 

Looking at the example of the numbers of garages per settlement, 
the means on p. 100 are in a rough proportion of 1: 1: 5 : 30, while 
the standard deviation estimates on p. 102 have a ratio of approxi- 
mately 1:1: 6: 20. If the latter is accepted as a closer approximation 
to the suitable sampling proportions, it may be considered desirable 
to take approximately 2% samples of strata (i) and (ii), increasing 
the percentage to about 12% for stratum (iii) and to about 40% for 
stratum (iv). In practice the actual percentages taken are controlled 
by the need to obtain a whole number of items in the sample so that 
the sampling fractions for strata (i)-(iv) are 0-024, 0-023, 0-138 and 
0-444, while the raising factors are 42:5, 44-0, 7:25 and 2:25. In 
Table XVI are set out the data used above, with sample means and 
standard deviations the same as before, but with the tabulation and 
calculation related to a variable sampling fraction within a stratified 
sample. The major difference is that the raising factor must be 
entered into the table, and that strata values must be adjusted by 
this amount before overall values of the mean and total can be esti- 
mated. Moreover these different sampling fractions and raising fac- 
tors must be introduced into the calculation of the various standard 
errors. 

The retention of sample means and standard deviations the same 
as with the uniform sampling fraction is deliberate, so that differences 
in standard errors etc. can be more easily appreciated. In reality, 
these values would differ at least slightly as normally occurs when a 
fresh sample is taken. 

To obtain the overall average it is necessary to multiply the number 
of units (column 5) and the sample total for each stratum (column c) 
by the appropriate raising factor (column e). Each of these is then 
summed (i.e. 2 b.e and X c.e) and then the estimated total (2 c.e) 
is divided by the estimated number of units (2 b.e). This gives the 
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estimated overall population mean, which is here 5-01. Also the esti- 
mated overall population total is obtained in this same calculation, 
being 2,685 in whole numbers. 


Table XVI 


Tabulation for stratified random sampling with variable sampling 
fraction 


No. of Estimated 
units Sample Sample Raising totalno. Estimated Estimated 
Strata insample total mean factor ofunits total mean 
@ (©) (©) (d) (e) (b.e) (c.e) (c.e/b.e) 
(i) 6 9 LS 42:5 255 383 i 
(ii) 4 8 2 44-0 176 352 2:0 
(iii) 12 120 10 7:25 87 870 10-0 
(iv) 8 480 60 2:25 18 1,080 60-0 
30 536 2,685 5:01 
2b 2'b.e 2'C.e 


The standard errors for the individual strata are obtained in the 
same way as was done when the sampling fraction was uniform, 
although care must be taken to use the requisite sampling fraction 
in each case. Thus by applying the formula 


ey 


and with f = 0-024, 0-023, 0-13 and 0-444 in strata (i) to (iv) respec- 
tively (p. 108), the following standard errors are obtained: 


stratum (i) 0-20; stratum (ii) 0-30; stratum (iii) 0-80; stratum (iv) 2-64 


On comparing these with those given on p. 102 it will be found that 
the present values are higher for strata (i) and (ii), but lower for 
strata (iii) and (iv). As the latter are the ones with the highest mean 
values, and which on other grounds are probably the more important, 
such an improvement is valuable. 

When calculating the standard error for the overall mean the basic 
approach is once more the same as in the earlier example, i.e. for each 
stratum the sum of the squares is obtained by 6?.n(1 — f). As the 
sampling fraction varies, however, it is necessary to multiply these 
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in each case by the square of the respective raising factor before they 
are summed. (This factor must here be squared to equate with the 
‘sum of the squares’ which it is modifying, whereas in the case pre- 
sented on p. 105 it was the standard deviation that was being modi- 
fied.) This sum is then divided by the estimated overall total number 
of items (N), instead of the sample number of items (7), before the 
square root is calculated to yield the standard deviation, i.e. 


[EDO 

N 
To obtain the standard error this is then divided by V. N, and by the 
same process of cancellation etc. as on p. 103 the formula for the 
standard error of the overall sample mean becomes 


Vz 62.nl — f).(1f)? 


In the present example the values in Table XVII are obtained (for 
detailed components, see pp. 102, 108 and Table XVI). 


Table XVII 


Calculation of the standard error of the overall sample mean for 
stratified random sampling with variable sampling fraction 


Strata 6?.n(l —f) (rf)? 6?.n(l — f).(rf)? 
(i) 1-46 1,806 2,637 

(ii) 1-41 1,936 2,730 

(iii) 93-10 52:56 4,893 

(iv) 444-80 5-06 2,251 


2 6?.n(l — f).(rf)® = 12,511 


12;5id ee T1139 


Standard error of the overall sample mean = - = —— =()-2] 
536 536 


With these standard errors thus calculated it is possible to 
establish the limits of the true values, and the following list gives 
them with a probability of 95%. As all the samples are small, how- 
ever, Student’s ¢ distribution has to be used instead of the normal 
distribution. 
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Limits of 
Sample true mean 
Body of data mean SHER f: EX S.E, (95% probability) 
Stratum (i) 1-5 0:20 2:57 0-51 0-99- 2-01 
Stratum (ii) 2:0 0-30 3:18 0-95 1-05— 2:95 
Stratum (iii) 10-0 0-80 2:20 1-76 8-24-11:76 
Stratum (iv) 60:0 2:64 2:36 6:23 53-77-66:23 
Total population 5-01 0:21 2:00 0:42 4-59 5-43 


Equally the standard errors of the estimated totals both of the 
strata and the overall populations can be calculated with little further 
difficulty. For each stratum the standard error of the estimated total] 
can be obtained by the same formula as on p. 105, i.e. standard error 
of the stratum population total = of. Ven — J). In this case, 
however, the values for f and rf will be different for each stratum. 
Thus for stratum (ii) the standard error of the estimated total of 352 
would be 


rf. V6%n.(1 — f) = 44V0-6? x 4 x 0-977 = 44V I-41 = 44 x 12 
= 52:8 

In the case of the overall population total, the standard error is ob- 
tained as on p. 104, except that the appropriate raising factor must 
be applied to each stratum individually, rather than to the sum of the 
values. It is therefore squared, as in the case on p- 110. So this stan- 
dard error becomes 


Vz 6.n(l — f).(rf)? 
The components for this are all included in the earlier calculations 


for the standard error of the overall sample mean (p. 110), so that in 
the present example it becomes 

V12,511 = 111-9 

In this way the limits of the overall population, at the 95% level of 
probability, are 2,685 + /— 224, i.e. from 2,461 to 2,909. 

All limits obtained by the formulae shown in Table XVIII overleaf 
are fairly closely defined. The slightly wider limits for the first two 
strata, as compared to the uniform sampling fraction, are more 
marked proportionately than in terms of actual values, while the 
improvement in the degree of reliability of the estimates in strata (iii) 
and (iv) is most valuable. Furthermore, the overall estimates of both 
mean and total values are also more closely limited in range, this 
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despite the fact that the total sample consisted of only 30 settlements 
compared to 55 when using the uniform sampling fraction. This 
increased degree of accuracy with variable sampling fractions is one of 
its most valuable attributes and this method of analysis should be 
used whenever possible. 


Table XVII 


Formulae for use with stratified random samples with a variable 
sampling fraction ' 


(i) Standard error of the Gu wie 
stratum sample mean oe f) (p. 109) 


(ii) Standard error of the  V26?.n(1 — f).(7f)? (p. 110) 
overall sample mean N ‘ 

(iii) Standard error of the 
stratum population total 


(iv) Standard error of the 
overall population total 


rf. V6 .nl —f) (p. 111) 


VE 6?.n —f).0f)? (p. 111) 


In fact, it can be extended even further, sub-strata being defined. 
For example, it would be possible for each of the four strata used 
above to be sub-divided in terms of areal characteristics, whether 
these be defined in terms of north-south location, of administrative 
units, or of any other feature. With increased sub-division, however, 
it is essential that there be an adequate sample in each sub-stratum, 
at least if it is desired to calculate the overall error. Innumerable geo- 
graphical problems, which involve large numbers of items, can be 
analysed in this way—farms can be grouped into regions (strata) and 
size (sub-strata) and their characteristics defined by sampling; rivers 
grouped into length and volume of flow; relief forms grouped in 
terms of rock lithology and degree of dissection; rainfall data 
grouped in terms of altitude and location. The possibilities are infi- 
nite, and in all cases a relatively close assessment of the character- 
istics of a large body of data can be obtained by analysing a fairly 
limited sample, provided that the sampling is organized effectively. 


Systematic Sampling 


A stratified sample such as this, with random sampling within each 
stratum and a variable sampling fraction to ensure an adequate 
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coverage of all strata, is probably the most effective way of sampling 
on this scale. At times, however, it may be desired to adopt a sys- 
tematic sampling technique. By this is meant that items are picked at 
some regular interval, e.g. every 10th item on a list; every 20th grid 
square; every 100th line across a map. This is permissible, and pro- 
vided that there is no periodic repetition of conditions at the same 
interval as the sample interval, then in general such a sample can be 
worked as a random sample or as a sample stratified in some pre- 
determined manner. The calculation of sample and population means 
and totals can be effected as in the case of a random stratified sample. 
Moreover, it is also possible to calculate an approximate standard 
error for the strata values by the same method as was used for a 
random sample from within a stratum (Table XV). No fully valid 
estimate is possible of the standard error of the overall mean or the 
overall total, however, although various devices allow of a general 
approximation. For these the reader should turn to one of the more 
advanced texts on the theme of sampling, as also for any further in- 
vestigation into sampling possibilities or techniques as a whole. These 
further studies, however, are virtually all based upon the essential 
foundations of sampling that have been outlined here, and a thorough 
understanding of these is required before the more advanced methods 
are considered. Moreover, for a very large proportion of the prob- 
lems that confront geographers the methods already outlined will 
prove quite adequate. 

The particular sampling techniques used, and the detail and com- 
plexity of the answers obtained, must always be ultimately related 
to the problem under study, to the degree of accuracy that is neces- 
sary and to the sort of answer that is required. In all such cases, how- 
ever, the answers in terms of means, standard deviations or totals are 
only sample values. In making estimates of the true values from these 
samples it is necessary to be aware of, and to be able to calculate, the 
standard error that such sampling introduces, so that the true values 
can be estimated within given limits. The values obtained in this way 
may only be required as an indication of the characteristics of that 
set of data, without further studies being based on such character- 
istics. More often, however, it is also desired to compare the char- 
acteristics of different sets of data, so that some judgment can be 
made concerning their similarities or differences. If the character- 
istics thus being compared are themselves based upon sample data, 
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from which the true values are estimated, then it is essential that the 
sampling error be remembered and considered when such compari- 
sons are being made. It is with such problems of comparison, and 
bearing in mind the various themes which have already been con- 
sidered in this and earlier chapters, that the following three chapters 
are concerned. 
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THE COMPARISON OF SAMPLE VALUES—I 


Statistical Significance 


So far attention has been mainly concentrated on the primary 
problem of defining, as briefly and concisely as possible, the condi- 
tions presented by a set of data, those data often having been selected 
by specified statistical methods. 

The geographer’s interest in quantitative analysis, however, is not 
limited to such studies of single sets of data, useful and instructive as 
such analyses may well be. Far more frequently it is necessary to com- 
pare one set of values with one or more other sets, as was suggested 
at the end of Chapter 7. The express purpose of such a comparison 
is usually either to group together similar sets so as to delimit 
regions of relatively similar conditions, or to assess the degree of 
difference between sets of data so that valid comparisons can be 
made. Very often such comparisons have been but a slight advance on 
purely subjective assessment, being made merely by a simple inspec- 
tion of sample mean values—the ‘battleship’ diagram purporting to 
show rainfall regime is one of the more common examples of this. 
Yet it is in this very problem of comparing the degree of similarity or 
dissimilarity between different sets of data that standard statistical 
techniques can prove of the greatest assistance to the geographer. 
With but slight extra labour they can readily provide relatively 
objective means of analysis which at Jeast should prevent conclusions 
of doubtful validity being drawn and at best should enable virtually 
firm deductions to be made in many cases. Decisions concerning the 
validity of the difference between various sample mean values can 
thus be taken out of the realm of guesswork and brought into that of 
statistical probability. 

The possible methods that can be used for the purpose of com- 
parison are many and varied. Some of these methods can be used 
only in certain cases, while at other times several of the methods may 
be pertinent and permissible, and a choice has to be made between 
them according to ease of computation and the degree of accuracy 
required. Within the following three chapters these various methods 
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will be outlined and each will be used to analyse several problems, 
the examples being deliberately chosen to illustrate the diversity of 
fields in which the methods may be employed. 

The general problem considered in this chapter is one which 
frequently confronts the geographer, i.e. whether the difference 
between two sample mean values is such that further conclusions 
can validly be based on this difference in value or whether the 
difference is more apparent than real. For example, in a comparative 
study of two coalfields (A) and (B) ten pits were chosen at random 
(see p. 90) from each field and the production of each pit obtained 
over a given period. It is necessary to establish whether or not one of 
these coalfields has a significantly larger production of coal per pit 
than the other. From this consideration many others would then 
develop, such as why this difference exists, or its influence on in- 
dustrial activity or on trade in coal. These later considerations are all 
dependent on a correct assessment of whether or not the two coal- 
fields differ significantly in terms of production per pit and to this end 
the mean production per pit may be calculated for each coalfield. It 
can be assumed that coalfield A had an average production of 0-30 
million tons per pit while coalfield B produced an average of 0-34 
million tons per pit, i.e. there was a difference of 0-04 million tons 
between these two sample average values. Is this difference of 0:04 
million tons between these two sets of sample data a statistically 
significant difference or is it likely to have been the result of mere 
chance related to the particular ten pits in each field for which data 
were obtained ? 

This phrase—statistically significant—will recur frequently in later 
sections and it is essential that the concept be clearly understood. If a 
difference is said to be statistically significant this means that it is 
extremely improbable that such a difference could have occurred by 
chance. This has two main implications. First it implies that if, 
instead of the actual values under consideration, other samples were 
taken of these conditions, or if the full body of data were taken, then 
it is extremely likely that this difference would still be observed— 
always assuming that the sample being considered is representative 
of the full body of data. Second, it implies that if the values recorded 
in the sets of data being compared were all put together and two 
samples picked from this grouped collection at random, i.e. by 
chance, then the difference between these ‘chance’ samples would be 
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less than the difference between the actual samples being compared. 
The criteria for deciding on statistical significance in this sense, and 
the degree of reliability to be placed on such a decision, are varied 
and will be examined in the succeeding pages. 


Dispersion Diagrams 


A more detailed treatment of the coalfield production figures is 
very revealing in these terms. A full record for the two samples is set 
out below, and this gives an opportunity for a more direct assessment 
of the difference between the two sets of values. A simple visual 
method of comparison is suggested first. 


Annual production by sample pits in two coalfields 


Coalfield A Coalfield B 
0:25 0:27 
0:26 0-28 
0:27 0-29 
0:27 0:33 
0:28 0:34 
0:29 0:35 
0-32 0:35 
0-34 0:38 
0-35 0:39 
0:37 0-42 
Average (4) = 0-30 Average (b) = 0-34 


This method is best illustrated by plotting the two sets of data 
(Fig. 26a) as dispersion diagrams, and entering the median and 
quartile values on each. It can be seen that despite the differences in 
the mean values (the median of A is 0-285 and of B 0-345) there is 
nevertheless considerable overlap between the two records. Is this 
overlap so great that there is no significant difference between the 
records, or is it so slight that it can reasonably be ignored? 

Three sets of conditions are regarded as being of diagnostic value 
and these need to be considered first in general terms before they are 
applied to this coalfield example. These three sets of conditions are 
defined simply in terms of the relative positions of the quartile and 
median values, which can easily be established. In the first case 
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(Fig. 25a) the lower quartile of one record (L.Q.,) is greater in 
magnitude than is the upper quartile of the other record (U.Q.,), and 
there is thus a clear space on the diagrams between the ranges of the 
central 50% of the two sets of data. This relationship can be regarded 
as indicating a significant difference between the records under 
analysis. At times, however, this degree of difference is not found, 
and L.Q., does not exceed U.Q.,. Here a transitional set of con- 
ditions can be defined, when the lower quartile of one record is 
less than the upper quartile of the other but is still greater than the 
median, while it is the median of the first record which exceeds the 
upper quartile of the second, ie. L.Q.. > M, and M, > U.Q.; 


(Fig. 255). If both these 
O ® oO ©@ QO ® conditions are satisfied, 


UQ i 
A o yq _ then the difference between 
LQ vale uq|M the two records is probably 
M M j|LQ i significant but not abso- 
LQ LQ LQ 


lutely so. Finally, if either 
Significant | Probably sit Oe or both of the above condi- 
significant significant f 
tions do not hold true then 
io 8 2 tter what the diff 
Figure 25. Criteria for the definition of DO. Mate WAS ste Cnet 
degrees of statistical significance from dis- ence In Mean values, it is 


persion diagrams (after P. R. Crowe, Scottish not safe to assume that the 
Geographical Magazine, 49 (1933)) two recordsare significantly 
different. Thus flexibility is introduced by this method, a transitional 
category is defined and a marked degree of difference is required 
before a fully significant difference is established. 

Armed with this technique, it is possible to return to the coalfield 
example to assess the degree of significance of the difference between 
the two sets of data. Figure 26a presents the data suitably rearranged. 
The median of coalfield B is greater than the upper quartile of coal- 
field A (0:345 as compared to 0-340) and the lower quartile of B is 
greater than the median of A (0-290 compared to 0-285). Thus the 
difference between the two coalfields is probably significant but not 
clearly so, especially as this degree of significance is only just achieved 
in terms of both criteria considered. Although it is legitimate to con- 
tinue working on the assumption of a difference in production per 
pit between the two, one would really like more evidence, i.e. a 
larger sample, and conclusions should certainly not be pressed too 
far until such extra evidence has been obtained and analysed. 
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This sort of simple graphical assessment can be made in any branch 
of geographical study, and the following two examples provide 
further illustration of its value. An investigation of low-level cliff 
remnants around a broad east-west estuary is in progress. On either 
side of this estuary ten such remnants are found, rising from a wave- 
cut platform. These are not strictly a random sample, and may in 
fact be biased in terms of sites favouring preservation. They may be 
the only data available, however, and as indicated in p. 98, they 
may therefore be analysed as if they were a random sample, though 
the results of such an analysis must be used with care. The altitudes 
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Figure 26. Specific examples of dispersion diagrams used for tests of significance 


of the bases of these cliffs are accurately measured, and the mean 
value of the cliff base is found for each side of the estuary. On com- 
parison it is found that these mean values differ by 2 ft. if the average 
is used (17 ft. O.D. on the southern side and 19 ft. O.D. on the 
northern), and by 2:5 ft. if the median is used (17 ft. O.D. and 19-5 
ft. O.D. onthe southern and northern sides respectively). Is this small 
difference in sample mean values merely the result of the limited 
number of observations, or is there a really valid difference between 
the two sides of the estuary which merits some explanation? A quick 
guide to a decision can clearly be made by means of dispersion 
diagrams, plotted from the following observed values: 


heights on S, side (in ft. O.D.)—15, 19, 18, 17, 17, 19, 14, 16, 19, 16. 
heights on N. side (in ft. O.D.)—16, 18, 21, 20, 19, 20, 19, 21, 20, 16. 


A visual comparison can then be made (Fig. 26), and it is found that 
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lower quartile (N) is greater than median (S), and median (N) is 
greater than upper quartile (S). In other words, the slight difference 
in mean height is probably significant, though not conclusively so. 
The analysis suggests two things. First, more observational data are 
required (i.e. a larger sample) in the hope that these will confirm or 
refute this tendency for significance. Second, it is worth considering 
possible causes for such a difference, for it must be clear that an 
analysis such as this can only indicate whether or not a statistically 
significant difference exists, not what may have caused it. 

A slightly different problem is presented by a study of former 
agricultural land-use in an area where lowland clay vales and upland 
limestone plateaux are in close juxtaposition. A stratified random 
sample is made of parishes centred upon, or mainly within, the low- 
lands and of those that are mainly upland parishes. In each stratum 
the sample consists of 10 items (i.e. parishes). Records indicate that 
for some given date in the past the percentage of land under meadow 
in each case was as follows: 


lowland parishes (% in ineadow)—10, 20, 25, 25, 30, 35, 45, 50, 50, 
60. 
upland parishes (% in meadow)—25, 30, 30, 40, 40, 50, 50, 60, 60, 65. 


A simple calculation shows that the averages for lowlands and up- 
lands differ as between 35% and 45°% meadowland, while in terms of 
median the difference is between 32:5% and 45%. Again the problem 
is similar; is this difference in sample mean values between two con- 
trasting groups of parishes sufficient to justify an emphasis on con- 
trasting proportions of meadowland as between lowland and upland 
(or as between clayland and limestone), or is the range of values 
within each group such that a generalization of that sort is unsound 
and unjustified? The dispersion diagrams in Fig. 26c illustrate the 
considerable range of overlap between the two sets of data; neither 
of the criteria for even a “probably significant’ verdict is present. So 
these data would not justify a claim of a significant contrast between 
lowland and upland conditions. If a larger sample were studied data 
justifying such a contrast might well be obtained, but meantime any 
conclusions claiming a causal relationship between these parish 
groups and proportions of meadowland would be unsound. 

This graphical method of assessing significance is thus simple to 
apply to a wide range of problems, but it has several limitations. It 
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has only three sets of distinctive conditions without any gradation 
between them. It, moreover, makes no real allowance for the number 
of items entering into the computation, which is especially im- 
portant in the case of such examples as these where the samples are 
small ones, and the degree of stringency involved in the test for 
significance could well be intensified. Finally, of course, it is based on 
the median and quartiles as measurements of mean and deviation 
values, while it is the arithmetic average and the standard deviation 
which, as indicated in Chapters 2 and 3, possess the greatest merit in 
these fields. 


Standard Error of the Difference 


There are two methods which, in large measure, eliminate these 
disadvantages and although they each involve more computation 
than do the graphical methods they are, on balance, infinitely pre- 
ferable. These two methods will be applied first to the coalfield 
example, and then to the other problems briefly considered above, 
and the difference between the methods will thus become apparent. 
In Chapter 6 the relationship between the mean value of a sample 
and the true mean value was considered, this relationship (known as 


6 2 
the Standard Error of the Mean) being expressed by ie or ts wi 
n n 


i.e. the best estimate of the standard deviation divided by the square 
root of the number of items in the sample. The examples considered 
in the present chapter are also based on samples: the problem is 
whether or not the differences between these sample means are suf- 
ficiently great to justify a conclusion that the true means also differ 
significantly. 

Thus a comparison is being made between two sample means, each 
of which has a standard error (S.E.). From the data for the coalfield 
example set out on p. 117 it can be calculated that coalfield A has a 
sample mean @ of 0-30 million tons and a standard error S.E., of 
0-013 million tons, while for coalfield B the sample mean 5 is 0:34 
million tons and standard error S.E., is 0-016 million tons. 

Both the methods now to be considered utilize these facts, in that 
both are concerned with assessing, from these data, the standard 
error of the difference between these two sample means, i.e. the 
standard error of |@— 46]. This standard error partakes of the 
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probability characteristics of the normal frequency curve, as did the 
standard error of the mean (p. 75), so that the probability that the 
actual difference will be more than twice this standard error is about 
5%, and that it will be more than three times this standard error is 
about 0-27%. In other words, if the actual difference between d and b 
(in this case 0:04 million tons) is greater than twice the standard 
error of the difference, then it is unlikely (though not completely 
certain) that a difference of this size between the two sample means 
occurred by chance, i.e. the difference is ‘probably significant’ and it 
is likely that it would also occur between the true means. If it is 
greater than three times the standard error of the difference, however, 
then the difference is almost certainly significant (99-7% certain). 


Coalfield A Coalfield B 
a = 0-30 b = 0:34 
Ga = 0:042 6p = 0:05 
SE Gu 0:042 ce Ob 0:05 
oN gee 10 ears Te VET; 
= 0-013 = 0:016 


An assessment of the Standard Error of the Difference between 
sample means can thus provide a valuable test of significance. It is 
based on average and standard deviation values, it allows for the 
number of items in each sample, and it imposes sufficiently stringent 
conditions, i.e. odds of at least 19:1 before ‘probably significant’ 
can be applied, for findings to be accepted with considerable con- 
fidence. But how can this standard error of the difference be cal- 
culated ? The method depends on the fact that the standard error of 
the mean is a function of the standard deviation, which is itself the 
square root of the variance (p. 22). So, if the standard error of the 


6 
sample mean is ies then the variance of the sample mean is the 
n 


igen o 
standard error squared, i.e. (=) , or more simply am Furthermore, 
n 


it can be accepted that the variance of the sum of, or the difference 
between, two sample means is the sum of the separate variances of 
the two sample means. To put it another way, in adding together 
two sample means, or in subtracting one from another, each of the 
values in the calculation has its own standard error and therefore 
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the answer is itself subject to error from both sources, i.e. the sample 
answer, be it sum or difference, is liable to differ more from the true 
answer than do either of the sample means from their respective true 
means. 

Taking now the variance of the difference between two sample 
means, and applying the rule above: 
the variance of the difference between a and 6 
= the variance of @ plus the variance of 6 
; ee aa es aa aK 
ie. var. @ — 6) = +— or (—%) 4+ /—% 

Ng Np Vig 

Moreover, as was indicated above, the variance of (4 — 4) is the 
standard error of (4 — 6) squared; conversely, the standard error of 
(a — b) is the square root of the variance of (4 — 6). 
ie. S.E. (4 — 6) = 


a8 | 68 g il ba (BY 
Ng Ny Vtg V Ny 
This formula, in either of its forms, can then be applied to the coal- 


field data, calculating as follows: 
S.E. (0-04 mill. tons diff.) = 


psa 0-052 of 0:042\2 0:05 \2 
0 * 10 oN Cars) * Ge) 
es 0-0025 =+/0-01332 + 0-01582 
SNe? 10 
0-00416 =V0-000176 + 0-00025 
Ne 
=V0-0004 = 0-02 =V0-0004 = 0:02 


Returning to the earlier argument that there is only a 5% probability 
that the actual difference will be as great as twice the standard error 
of the difference, these two values can now be compared: 

actual difference = 0-04 mill. tons 

2 S.E. of difference = 0-04 mill. tons 

This means that if all the twenty values for the two coalfields were 
taken together and grouped into two sets of ten purely at random, 
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a difference as great as the one observed, i.e. 0-04 million tons, would 
occur on no more than 5% of the occasions it was done. So it would 
occur by chance one time in 20, and this 5% (or 2 S.E.) level of prob- 
ability is taken as the highest chance probability value which can be 
allowed if the difference is to be described as probably significant. If 
the difference exceeds two and a half or three times the standard 
error then it is truly significant, but if it is less than twice the standard 
error then the difference is possibly not significant. It must be stressed 
that a value of this latter magnitude does not prove that the difference 
is not significant, but rather it indicates that a significant difference 
has not been adequately proved, and that judgment must be deferred. 
What must be borne in mind is that if a difference of this order, i.e. 
less than 2 standard errors, is obtained, any further deductions based 
on an assumed difference between the two sets of data may be un- 
sound and are at best unproven. 

In the coalfield example this test by the standard error of the 
difference gives the same answer as did the dispersion diagrams, 1.e. 
probably significant. On the other hand, this present method gives a 
numerical expression of the degree of significance, and from this it 
can be appreciated that the difference only just reaches the critical 
value. On the other hand, obvious critical limits for varying degrees 
of probability are few, being mainly whole-number multiples of the 
standard error. Intervening values can be computed but this is a 
needlessly laborious process. 


Student’s t Test 


The best and simplest way to eliminate this difficulty is to apply 
a more refined technique, though still embodying the standard error 
of the difference. This technique is known as Student’s t Test and 
employs the Student’s ¢ distribution introduced in Chapter 6 (p. 81). 
It provides an index—t—to represent the relationship between the 
difference between the means and the standard error of this difference. 
This index can then be referred to prepared tables or a graph, from 
which the degree of significance of the difference can be assessed. 
Student’s r, the index concerned, is readily calculated as follows: 


difference between the means 
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|a—5| 

ea" Sl Pars =a 
Og On” 
Ng Ny 


So, instead of simply looking to see whether the observed difference 
is greater than two standard errors, a calculation is made to express 
exactly how many times greater than the standard error the observed 
difference really is. In the coalfield example, therefore, it is seen that 


i, 0-34 — 0:30 0-04 O04 « 
ry SRE 00025 V0-0004 0:02 — 
10 10 


It is this value of t = 2-0 which is checked on tables or graph to find 
the percentage probability of it occurring by chance. It is here, how- 
ever, that a more fundamental difference enters into this method as 
compared with the previous one. On the graph of Student’s ¢ (Fig. 
27) the co-ordinates are (i) the value of ¢ and (ii) a value representing 
the number of occurrences on which the comparison is based. This is 
because in general terms the significance of a given value of f is less 
the smaller the number of occurrences involved. That is to say, the 
smaller the samples, the greater the difference between the means 
must be (i.e. a higher value for f) in order to reach a given level of 
significance. Once the number of occurrences reaches 35-40 little 
further change occurs in the required value for ¢ as the number of 
occurrences increases. 

One further point needs stressing, however. As indicated on p. 81 
Student’s ¢ graph (or table) does not use the exact number of occur- 
rences but instead the number known as the “degrees of freedom’. 
By this is meant that number of values that can be assigned arbi- 
trarily, assuming that the sample mean remains as it is. So, in the 
case of coalfield A once nine of the values have been established then 
the tenth value must follow automatically to yield a sample mean of 
0-30 million tons. The same is also true for coalfield B, so for each 
set of data the degrees of freedom are one less than the number of 
occurrences, i.e. (x — 1). For the full comparison here, therefore, 
both values of degrees of freedom must be incorporated, and this can 


be written as: 
degrees of freedom (d.f.) = (#4 — 1) + (% — 1) 
= Ng + Mm — 2 
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In this example degrees of freedom are therefore 10 + 10 — 2 
= 18, this indicating that the value of t = 2:0 is based on 18 freely 
varying occurrences (the remaining two occurrences being controlled 
by the requirements of mean values and the other 18 occurrences). 
By reading t = 2-0 against d.f. = 18 in Fig. 27 it can be seen that, 
because of both the scatter of values within each sample and the 
actual size of the two samples, the difference of 0-04 million tons 
between the two means does not quite reach the 5% level of prob- 
ability, i.e. it does not quite qualify as probably significant. Thus, by 
considering the size of the sample (by means of degrees of freedom) 
the test becomes more stringent, and a difference which seemed 
probably significant by other looser tests becomes of marginal 
validity when Student’s ¢ test is applied. Again it must be stressed 
that this does not mean that there was no difference between the two 
coalfields as regards output per pit. It does mean, however, that as 
the values are derived from samples the figures in themselves are not 
conclusive and the verdict must remain as ‘not proven’. 


Other Specific Examples 


The calculations involved in these methods usually follow on the 
prior computation of average and standard deviation values for 
other purposes. Furthermore, there is nothing abstruse or difficult in 
them, for squares and square roots represent the most advanced 
techniques required. However, the reader may wish to see these 
methods employed in one or two examples, and it is therefore 
proposed to rework the geomorphological and agricultural examples 
previously analysed by graphical methods. 

For the comparison of cliff-foot heights on either side of an 
estuary the data presented on p. 119 can be summarized as follows: 


southern side—average (,) = 17; best estimate of S.D. (6,) 
= 1-77; no. of items (n,) = 10 

northern side—average (X,) = 19; best estimate of S.D. (6,) 
= 1-825; no. of items (n,) = 10 
The problem is to assess the significance of this difference of 2 ft. 
between the sample means. As with the coalfields, the comparison of 


dispersion diagrams in this problem gave a ‘probably significant’ 
answer, but as the coalfield example showed, this need not necessarily 
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be the answer given by these other methods. To calculate the standard 
error of the difference, so as to use it as a test, the following formula 
is again used: 


ce aor 
S.E. (¥, — %,) = ae +<2 Oe 
potas IIS a Ii on 1/8 33 13 i a) 


= eae TOT — Vee = o00 


Twice the standard error thus gives a value of 1-608 and three times 
the standard error one of 2-412. The actual difference being 2:0, i.e. 
between two and three standard errors, it is clearly a probably 
significant one. 
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Figure 27. Student’s t Graph (based on data in D. V. Lindley and J. C. P. Miller, 
Cambridge Elementary Statistical Tables, Table 3) 
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If this standard error of the difference is employed in Student’s 
t Test the same general answer is obtained: 


Px as)| 2:0 
eae heel = 2-49 
‘ ge 62 0804 
wate, a oat 
Ny No 


The degrees of freedom (m, + n, — 2) are 18, and by reference to the 
graph (Fig. 27) it can be seen that the observed difference is significant 
at about the 3% level. So in this case the application of these mathe- 
matically sounder and more stringent tests not only confirms that the 
difference observed is probably significant, but also indicates that 
the difference is much nearer being truly significant than could 
possibly be assessed by means of the dispersion diagram. 

As for the comparison of proportions of meadowland between 
lowland and upland parishes, even the dispersion diagrams suggested 
a lack of significance. Some indication of the degree to which the 
difference fails to be significant could well be of value, however, and 
it is therefore worth while analysing by these further methods. The 
data may be summarized as follows: 
lowland parishes X, = 35; 6, = 15-8 ; n, = 10 
upland parishes x, = 45; 6, = 14-15; n, = 10 
From these data the standard error of the difference may be readily 
calculated: 


wa srk 62 6,2 15-82 14-152 
eam aN 270 
1 2 


250 200 ———— 
= Ree 10 = ¥ 250 + 200 = V45-0 
= 6°7 
As twice the standard error is thus 13:4, and the actual difference is 
only 10, it is clear that this method too indicates a lack of significant 
difference. Applying these calculations to Student’s ¢ Test, the actual 
probability of this difference occurring by chance can be assessed: 


| %—%| _ 10 
6 G8 G7 


= 1-49 


= 


Ny Ny 
With again 18 degrees of freedom, it can be seen that this value of ¢ 
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fails even to reach the line of 10% probability, let alone the necessary 
5% level. 

By now the relative advantages and disadvantages of these dif- 
fering methods must be fairly obvious. For rapid comparison, 
especially if the data involve numbers which prove awkward for easy 
calculation, the visual assessment by dispersion diagrams is useful. 
For greater precision, however, the other methods are needed. If the 
number of occurrences is large, i.e. with large samples, the use of the 
Standard Error of the Difference is as sound as any other, and is 
usually preferable in such cases. With moderate to small samples, 
Student’s ¢ Test should be applied, however, while in all cases it is 
desirable to use the ‘best estimate’ of the standard deviation, as it is 
sample values which are being compared. 

In all the examples which have so far been analysed, however, the 
two sample means have always each been based on the same size of 
sample. This is not invariably so with data that the geographer may 
need to examine. It might be necessary, for example, to assess the 
relative importance of two different routeways as outlets for a rather 
inaccessible mining area. Sample traffic surveys are taken of the 
number of mining lorries using each of these two routes, and the 
mean values obtained are 150 lorries per week for Route A and 200 
lorries per week for Route B. On such data, traffic flow diagrams 
have often been prepared, and a difference as great as this would 
probably be accepted at face value. Two important pieces of informa- 
tion are not provided in the above figures, however. One is an index 
of scatter, i.e. deviation, and the other is the number of weeks during 
which counts were made, i.e. the size of the sample. With these added 
the relevant data are as follows: 

Route A—X, = 150; 6, = 59-9; n, = 20 
Route B—x, = 200; 6, = 66°5; nm, = 10 
These values would obtain if the observed values were: 


Route A: 40, 60, 80, 90, 100, 110, 120, 130, 140, 150, 150, 160, 170, 
180, 190, 200, 210, 220, 240, 260. 

Route B: 80, 110, 160, 180, 200, 200, 220, 240, 290, 320. 

It can now be seen that on both routes numbers of lorries per week 
were highly variable. Moreover, a greater number of counts was 
made on one route than on the other, i.e. the sizes of the two samples 
differ. Such limitations and qualifications as these are often operative 
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in this kind of study, and it is therefore necessary to apply fairly 
stringent tests before accepting the apparently marked difference 
between the routes as applying over the long term. Using Student’s ¢ 
Test and the ‘best estimate’ of the standard deviations, the following 
is found: 


Be Ee Wa tend Pepe ne UN» 
oa pf SP ~——- (3588 «4422, 179-4 + 442-2 
20+ AO 20. tied 
50 50 
= —— = —— = 2.0 (approx.) 
/623:9 24:93 33 


d.f. = m+ n, —2 = 20+ 10 —2 = 28 


On reading, ¢ = 2-0 against 28 degrees of freedom in Fig. 27 it is 
found that this difference of 50 lorries per week, apparently so clear- 
cut, does not reach the 5% level of probability that such a difference 
could occur by chance. Once again it must be stressed that this 
simply means that a significant difference has not been proven, not 
that it does not exist. It also indicates that more data are required, 
and with an increase in the number of traffic-flow surveys, especially 
on Route B, this question of degree of significance should be clarified 
one way or another. As was suggested in p. 107, the relative sizes of 
the samples should be proportional to the standard deviations. 


Comparison of Coefficients of Variation 


Finally, before moving to other methods of comparing differing 
sets of data in the succeeding chapters, there is yet a further modifica- 
tion of one of these methods which is of value in many geographical 
studies. Especially in climatology, though at times in other branches 
of the subject too, maps are prepared based on the Coefficient of 
Variation (V)—p. 40—and then comparisons are made between 
places with differing values of this coefficient. Far too seldom, how- 
ever, has an investigation first been made to see whether the difference 
being explained or used is really statistically significant, or whether 
it is likely to be simply a chance occurrence. 

Such an assessment can be made by a modification of the formula 
for the Standard Error of the Difference between sample means. The 
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oy 
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indicated earlier (p. 78) there is also a standard error of the standard 


latter relies on the standard error of the mean As has been 


. . oO . . . 
Se geo ari ca — and from this is derived the standard error of the 
n 


. aa V 
coefficient of variation (V), i.e. von This latter value can readily be 
n 


. o . 
substituted for ae in the formula for the standard error of the 
n 


difference between means, to give a method of calculating the 
standard error of the difference between coefiicients of variation. So 
instead of 


area: 6,7 Gy? ; 
S.E. (¥, — %,.) = ,/ — + — can be written 
Ny 2 
ps 
S.E. (V, — V.) = as Ze " 
2, 2h, 


A comparison may then be made between, for example, two rainfall 
stations, each with a 30-year record, but in one case with V = 13% 
and in the other case V = 10% 


Soo tO: 169 , 100 269 
Thus: S.E. (13 — 10) a = a a ++ = ie 
=V45 = 21 
Twice the standard error is 4-2 and the actual difference only 3-0, so 
that it is far from being a significant difference. Even if the value 
for V, were increased to 14%, and the difference between V, and V, 
thus raised to 4%, this difference would still not be even probably 
significant. 

It is thus both salutary and profitable to ensure that the differences 
between the sample mean values, or possibly between the sample 
variability values, under consideration possess a certain element of 
statistical significance. At times this may mean that judgment must 
be deferred, or even that the absence of a statistically significant 
difference must be accepted. At other times a difference of sufficient 
significance is established for valid conclusions and further deduc- 
tions to be based firmly and soundly upon it. It is in this role of 
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focusing attention upon the legitimate cases, and indicating the 
degree of reliability or unreliability of marginal cases, that the 
methods outlined in this chapter can provide the greatest assistance 
to the geographer. Other methods, for dealing with more complex 
data, or with data in different forms, also exist, and some of these 
will be considered in the following two chapters. 
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CHAPTER 9 
THE COMPARISON OF SAMPLE VALUES—II 


(The analysis of variance) 


In Chapter 8 several methods were presented by which it is possible 
to make some objective assessment of the validity of the differences 
between two sample mean values. Although problems of this sort are 
relatively frequent in the geographical field, there are also many 
occasions when a comparison is required between more than two 
sets of data. Such an assessment of whether the difference between 
several sets of data is significant or not is essential before any con- 
sideration is given to what is causing the difference. The sort of 
questions and problems that such a consideration may pose, and the 
methods by which they can often be resolved, are best approached 
through a specific example, and such an approach is adopted in the 
following pages. After this initial consideration, it will then be 
possible to employ the same methods in the analysis of other problems. 


The Allocation of the Variance 


Suppose that a survey were being made of agriculture in some 
part of the country, and sampling techniques were being employed. 
A stratified random sample of farms was made with the aim of com- 
paring crop yields between the strata. These strata were defined in 
terms of the character of the soil, there being three strata related to 
fen peat soils, soils developed on Keuper marls and those on boulder 
clay. In each stratum the random sample consisted of ten farms, and 
on considering the cereal yields for these farms it was found that the 
average yield of the farms on fen peat soil was 24-3 bushels per acre, 
that of those on Keuper marl was 22:2 bushels per acre, while on 
boulder clay it was 21-0 bushels per acre. The problem thus presented 
is whether the difference between these three samples is such that it 
would be legitimate to claim that average cereal yields in that area 
vary significantly in relation to the parent material of the soil. 

The first requirement is to set out the full sample data on which 
this comparison must be based. These are tabulated below in the 
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three strata outlined above, and it is seen that values vary markedly 
within each stratum as well as between the strata. Furthermore, if all 
the thirty values are grouped together the combined set of data 
will possess a variance (c”) which reflects the tendency for the indivi- 
dual farm values to vary around an overall mean value. This being 
so, any stratification of the thirty farms into three groups, even if 
such stratification be done purely at random, is likely to lead to some 
difference between the means of the three strata. The question is 
therefore whether the difference between the strata used (which were 
based on a particular characteristic, i.e. soils) merely reflects such a 
random difference between any three groups of ten items each out of 


Cereal yields of ten sample farms on the following soils 


Fen peat Keuper marl Boulder clay 
24 17 19 
Pal 25 18 
21 24 22 
22 19 24 
26 28 23 
19 21 18 
25 20 21 
pe) 25 19 
26 19 25 
24 24 21 
Average 24-3 22:2 21:0 


these thirty items, or whether this observed difference is significantly 
greater than such a random difference would be. In other words, is 
the difference between the samples (referred to as the ‘between 
sample difference’) significantly greater than the differences that can 
be observed within each sample (referred to as the ‘within sample 
difference’)? If it is not significantly greater, then it could well be 
that the observed differences between the strata are only the result of 
chance grouping, in which case no proof of the influence of soils on 
crop yields can be obtained from this evidence. On the other hand, if 
the “between sample difference’ were significantly greater than the 
‘within sample difference’ then it would be legitimate to assume that 
soil differences (as defined in the stratification) do lead to differences 
in crop yields. This is not to deny, of course, that many other factors 
will also affect crop yields, e.g. the quality of farm management or 
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the amount of capital invested in the farm, for it is these other 
factors which lead to the observed differences between the sample 
items in each stratum, i.e. they comprise the ‘within sample dif- 
ference’. 

The first necessity is therefore to divide the variance of the total 
set of thirty values into these two component groups, i.e. to allocate 
the amount of the overall variance that is due to ‘between sample 
differences’ and that part that is due to ‘within sample differences’. 
It will be remembered that the variance is simply the average of the 
sum of the squares of the deviations from the average (p. 21), but to 
calculate this in detail can involve considerable labour. It is therefore 
desirable to adopt the expedient of an ‘assumed average’ as was done 
in the case of the short-cut calculation of the average and standard 
deviation in Chapter 3. Once again the accuracy of the assumed 
average does not affect the method of working; its sole effect is that 
the nearer the assumed average is to the actual average the smaller 
will be the numbers with which it is necessary to work. 

In the present example this assumed average can conveniently be 
taken as 22, the choice being made largely by visual assessment— 
experience will lead to a sound choice being made in most cases. 
Once this assumed average has been chosen, the original data can 
be retabulated simply as differences from this value. This has been 
done in the following table. 


SQUARES OF 


ITEMS LESS 22 ‘ITEMS LESS 22’ 
2 —5 —3 4 Das) 9 
5 3 —4 225) 9 16 
—1 2 0 1 4 0 
0 —3 D 0 9 4 
4 6 1 16 36 1 
—3 —1 —4 9 1 16 
3 —2 —1 9 4 1 
Ui 3 —3 49 9 9 
4 —3 3 16 9 9 
P2 vs —1 4 4 1 
Total 23 2 —10 Total 133 110 66 


The values on the left represent the deviations of the individual 
items from the assumed average. To calculate the variance these 


135 


STATISTICAL METHODS AND THE GEOGRAPHER 


deviations must be squared, and this has been done on the right- 
hand side in the table above under the heading “Squares of “Items 
less 22” ’. Having prepared these two tables it is convenient to add 
up each column so that the sum of each column is available for later 
calculations. 

To obtain the ‘overall variance’, which must then later be allocated 
to ‘between’ and ‘within’ sample differences, these individual 
squares of the deviations from the mean must be summed. This is 
simply done by adding together the totals for the three columns of 
the samples, i.e. in this case by 


133 + 110 + 66 = 309 


However, the deviations which were squared in this connection were 
deviations from an assumed mean, and there is therefore the need to 
apply a ‘correction factor’ to allow for this. To calculate this ‘cor- 
rection factor’ it is necessary to return to the tabulated ‘Items less 
22’. The totals of the three samples represent the differences which are 
left over because of the difference between the assumed average and 
the actual average. This set of differences has been incorporated in 
the overall variance of 309 already obtained above. If, therefore, the 
amount of variance that this contributes towards the 309 total can be 
obtained, a suitable correction can be applied. This can be done by 
calculating the variance of these sample totals, i.e. the totals of the 
sample columns are summed to get the total difference from the 
assumed average; this value is squared; and then this is divided by 
the number of items in all the samples 


Le. 23+ 2+--(—10) = 15 
15 x 15 = 225 = T? (total differences squared) 
225 


IR 
— =75=—(N= ] i 
30 he ( total number of items) 


fb : ; : : 
The resultant value of W? i.e. 7-5 in this case, is the contribution to 


the overall variance value of 309 which is made by the difference 
between the assumed mean and the actual mean. Thus this value is 
the necessary ‘correction factor’ by which the total of the sum of the 
squares of the deviations from the assumed average (309) must be 
adjusted to get the sum of the squares from the actual average, i.e. 
309 — 7:5 = 301-5 = the sum of the squares. Finally, to obtain the 
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variance itself, the sum of the Squares must be divided by the number 
of occurrences. As these data are all samples, however, an element 
of safety is introduced by using the ‘degrees of freedom’ as in 
Student’s ¢ Test, ie. degrees of freedom = N — 1 = 30 — 1 = 29, 
Dividing this number into the sum of the squares (301-5) would give 
the overall variance from the actual mean, but in the calculation 
there is no need to obtain the variance itself. It is the component 
elements of the variance, i.e. the ‘sum of the squares of the deviations’ 
and the ‘degrees of freedom’, that are required. Then, instead of 
allocating the variance to ‘between’ and ‘within’ sample differences, 
these two components of the variance can be allocated instead. 

At this stage, however, it is as well to summarize the calculations 
that have so far been made, and to set them out in a systematic and 
orderly manner. After tabulating ‘Items less 22’ and ‘Squares of 
“Items less 22” ’, and obtaining the totals of each column (p. 135);the 
following procedure should be adopted to obtain the components of 
the overall variance. 


ITEMS LESS 22 SQUARES OF ‘ITEMS LESS 22’ 
Total 23 2 —10 Total 133 110 66 
T (sum of sample totals) = 15 Total sum of the squares = sum of 
N (total items) = 30 sample totals — correction factor 
Cc getfack ibe = 309 — 7:5 = 301°5 
‘orrection factor = — = — — 
IN. 7 530 Degrees of freedom = N — 1 
225 = 30 —1= 29 
= 30 = 75 =— 


Having thus obtained the overall picture the problem is to allocate 
the sum of the squares and the degrees of freedom to ‘between 
sample’ and ‘within sample’ groups. This is most easily done if those 
parts of these values that are due to ‘between sample’ conditions are 
first allocated. In any of the samples, if the overall average is applied 
to that sample then the total value for the sample would be that 
average multiplied by the number of items, i.e. in the present case, 
22 X 10 = 220. In the first sample, however, the total differs from 
this by 23, while the second and third samples differ from it by 2 and 
10 respectively. As it is the between sample value that is being assessed, 
it can be assumed that this overall deviation for the sample is evenly 
distributed between each of the occurrences. In this way the within 
sample variation is eliminated. 
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Therefore, whereas the sum of the squares of the differences in any 
body of data would be & (x — x)?, if the (x — x) value is the same 
for each item in the sample, as postulated above, then x (x — x)? 
= n(x — X)?. The following algebraic modification can then be 
made: 


n(x — X)? = (/n.(x — X)? = ( 


Se Gas 
ine n 


Cael _ @@— x)? 
s/n a n 


The value = (x — X) is the total value of the deviations from the 
average in the sample concerned, i.e. 23, 2 and 10 in the present 
example. Thus the sum of the squares for the sample, with within 
sample differences eliminated, can be obtained by squaring this total 
deviation of the sample and dividing by the number of items in that 
sample. From this it follows that the total value of the sum of the 
squares resulting from between sample differences is obtained by 
summing these values for all the samples, i.e. in the present example, 
by 

DOR te NO 0 ys OU eos 63:3 

{Oars ae dOnme Loca lO pak Os aiClt 

As in the present case the size of the sample is the same in all three 
instances, it is easier to sum the squares of the differences first and 
then divide this value by the number of items in each sample. The 
answer obtained by either of these methods is based on differences 
from the assumed mean, and therefore the correction factor must be 
subtracted from it to obtain the true answer. Thus the “between 
sample sum of the squares’ is 63-3 — 7:5 = 55-8. Having obtained 
this value, the ‘within sample sum of the squares’ is simply the 
amount that is left when this value is subtracted from the overall sum 
of the squares, i.e. 301-5 — 55-8 = 245-7. 

The degrees of freedom can be allocated in the same sort of way. 
In the case of ‘between sample’ conditions the degrees of freedom 
are simply 1 less than the number of samples being considered. Here 
there are three such samples and so the ‘between sample degrees of 
freedom’ are 3 — 1 = 2. Again, with this value obtained the ‘within 
sample degrees of freedom’ are simply obtained by calculating the 
difference between this value and the overall value, i.e. 29 — 2 = 27. 
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Once more, however, it is as well to summarize this reasoning and 
these calculations and to set them out carefully as follows: 


“Between sample’ conditions ‘Within sample’ conditions 
sum of the squares: sum of the squares: 
= total sum of squares — ‘be- 


= -(q? b2 ce) — ; 
n Pay cy, otrectondictor tween sample’ sum of squares 


(where n = no. of items per sample and = 301:5 — 55-8 = 245-7 
a, b, c are the sample totals of differ- aaa 
ences) 

23* + 2* + 10* 
seni! lien 

529 + 4 + 100 
SS OS - 5 

10 f 

633 
= ig er — 15 = 33:8 
degrees of freedom: degrees of freedom: 
=no.ofsamples—1 . = total degrees of freedom — 
=3—i=2 ‘between sample’ degrees of 

‘a freedom 


=29-2=27 


Snedecor’s Variance Ratio Test 


In this way the overall sum of the squares and degrees of freedom 
(and thus the overall variance) have been allocated to the two 
categories as regards origin. Part of the overall variance was produced 
by differences between the samples, this amount of variance being 
obtained by dividing the appropriate sum of the squares by the 
equivalent degrees of freedom, i.e. 55-8/2 = 27-9. As these are samples 
only, this is known as the ‘variance estimate’. Equally the other part 
of the overall variance was produced by differences within the samples, 
the division of sum of the squares by degrees of freedom here being 
245-7/27 = 9-1 = ‘variance estimate’. Again, tabulation allows these 
features to be seen more clearly. 


Source of Sum of Degrees of Variance 
variance squares freedom estimate 
@) (6) (©) (b/c) 

(i) between sample 55:8 2 27:9 

(ii) within sample 245:7 27 9:1 


These estimates of the variance of ‘between sample’ and ‘within 
sample’ conditions must now be compared. The purpose of such a 
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comparison is to see whether these variance estimates are so much 
alike that the differences between the samples simply reflect the 
differences within the samples, i.e. that no significant difference 
between the sample means can be assumed; or whether they are 
sufficiently dissimilar for a significant difference between the samples 
to be accepted. In making such a comparison it is desirable to 
assume what is called a ‘null hypothesis’. By this is meant that one 
assumes that no significant difference exists between the samples, 
and that therefore the two variance estimates are not significantly 
different, and then tests to see the probability that such an assump- 
tion is justified. A direct comparison of the two variance estimates of 
27:9 and 9:1 is not possible, however, because of the markedly dif- 
ferent number of occurrences on which they are based, i.e. the 
‘degrees of freedom’ are different in the two cases. To overcome 
this difficulty a test known as ‘Snedecor’s Variance Ratio Test’ is 
applied. This consists of a simple ratio which gives a value called 
‘Snedecor’s F” which is then referred to tables which indicate the 
probability that the assumed null hypothesis is correct. This ratio is 
calculated as follows: 


reater variance estimate 27:9 
Snedecor’s F = $$. = 
lesser variance estimate 9-1 


J ie pe VY) 

The tables to which this value is referred are known as the ‘per- 
centage points of the F-distribution’ and these are needed at least for 
the 5% and 1% levels (24% and 0-1% arc also useful at times). These 
tables (of which versions are given in Table XIX) indicate the per- 
centage probability that the difference between the samples could 
have occurred ‘by chance’, i.e. by the random grouping of data into 
three sets of ten values. Thus if the F value falls into the 5% prob- 
ability range then the differences between the samples would occur 
by chance only once in 20 such random groupings. In the other 19 
random groupings differences between the random samples would be 
less than those observed in the data under consideration. In other 
words, a difference of this order, in which the F value falls in the 3% 
probability range, is one that is probably significant. On the other 
hand, if the F value falls in the 1% probability range it means that a 
difference of the order of the observed one will occur by random 
grouping of the data into three ten-item samples only once in a 
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hundred times, i.e. there is a 99% probability that a difference of that 
order is not the result of chance grouping but is rather the result of 
some significant difference between the groupings chosen. 

Table XIX 


Percentage points of the F-distribution 


greater variance estimate 
lesser variance estimate 


5% Level of Variance Ratio 
Number of degrees of freedom of greater variance estimate 
1 2 3 4 5 10 20 (ee) 


161 200 216 225 230 242 248 254 
18:5 19 192 1 19-3 19-4 19:4 ee 19:5 
10-1 9-6 9:3 9-1 9-0 8-8 8-7 8-5 

aT, 6°9 6-6 6-4 6:3 6:0 5:8 5-6 
6:6 5:8 5:4 D2 5:0 4:7 4:6 4-4 
5:0 4:1 3h) 3°5 3:3 3-0 2°8 2°5 
4:3 Sh) 3:1 2:9 S| Pa) 2 1:8 
3°8 3:0 2:6 2:4 22 1:8 1:6 1:0 


8 ails el care is 


Number of degrees of freedom 
of lesser variance estimate 


1% Level of Variance Ratio 
Number of degrees of freedom of greater variance estimate 
1 2 3 4 5 10 20 oo 


1 4,100 5,000 5,400 5,600 5,800 6,000 6,200 6,400 
98 29 99 99 99 99 oF 99 
34 31 29 29 28 74] eet 26 
21 18 17 16 16 15 14 13 
16 13 12 11 11 10 9:6 9 
76 6:6 6:0 5-6 4:8 4-4 3-9 
8-1 5°8 4:9 4-4 41 3-4 DPS, 2-4 
6°6 46 3°8 3:3 3:0 PAR 1-9 1:0 


ee ts SN 


Number of degrees of freedom 
of lesser variance estimate 


For full details see: D. V. Lindley and J. C. P. Miller, Cambridge Elementary 
Statistical Tables, Cambridge, 1953 (Table 7). 
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To return to the example under study, it was seen that F = 3-07. 
It is then necessary to locate this value on the appropriate part of 
Table XIX. If the part for the 1% level is first considered it will be 
seen that at the top, from left to right, are set out the number of 
degrees of freedom of the greater variance estimate, while on the left, 
from top to bottom, are the number of degrees of freedom of the 
lesser variance estimate. By referring to p. 139 it will be seen that in 
the present case the former is 2 while the latter is 27. By setting the 
latter between 20 and infinity it is seen that the appropriate F number 
should be within the limits 4-6 to 5-8. On the 5% level part of Table 
XIX the same reference to these degrees of freedom shows that the 
appropriate F number should be within 3-0 and 3-5. As the actual 
value for the present example was 3-07 it obviously fits the latter 
case rather than the former. Thus it can be said that the differences 
with which the problem was concerned, while not being significant 
at the 1% level, were nevertheless significant at the 5% level, i.e. 
there is a probably significant difference between them. So, as the 
difference in cereal yields between these three samples of ten farms 
on fen peat (24-3 bushels per acre), ten on marl (22-2 bushels per 
acre) and ten on clay (21-0 bushels per acre) is probably significant, it 
would be justified to assume a soil/cereal yield relationship for the 
area concerned—at least until evidence proved otherwise. It would 
not be wise to be too dogmatic about this, however, nor to press such 
conclusions too far, for the probability was only at the 5% level and 
not at the 1% or 0-1% levels of greater certainty. 


Tabulated Example 


This method of assessing the validity of the differences between 
several sets of data, when the data are grouped (or stratified) accord- 
ing to some possible causal factor, is one which can be widely 
applied in terms of geographical problems. It is, undoubtedly, a 
somewhat complicated and involved technique, though it is by no 
means as involved as it seems when it is presented for the first time. 
It is therefore essential that this ‘analysis of variance’ now be 
applied to another problem to familiarize the reader with the routine 
that needs to be followed. Once the nature of the problem has 
been posed, the working out of the necessary calculations will be 
presented in a semi-tabular form so that it can be more readily 
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appreciated simply as a technique or routine. The reasoning behind 
this routine has already been presented in outline in connection with 
the previous example, and needs to be repeated in only one or two 
particulars. 

Suppose that a study were being made of the percentage of land 
under woodland in a series of medieval vills. As a large area may be 
involved it is decided to take a sample of these vills and to consider 
from this sample the extent to which differences in woodland per- 
centages vary in relation to possible causal factors. From other 
evidence and from experience it may be apparent that one likely 
factor leading to such differences may be the altitude at which the 
vills lie. The data are therefore stratified in terms of altitude, and 
random samples—each totalling ten in number—are taken within 
four height ranges, i.e. 0-300 ft., 300-600 ft., 600-900 ft., 900-1,200 
ft. As can be seen from the data presented below (Stage I) differences 
in the average percentages for these four ten-item samples appeared, 
while equally there were marked variations within each ten-item 
sample. Here is clearly a case where the analysis of variance, 
separating the “between sample’ and ‘within sample’ variances, is 
a suitable method of analysis, the aim being to assess the degree of 
significance of the differences between the sample averages. In other 
words, it is to find out whether or not altitude did exercise a 
significant influence on the percentage of the land in the vill that was 
under woodland. 


STAGE I Altitude of vills for which woodland percentages are given 
0-300 ft. 300-600 ft. 600-900 ft.  900-1,200 ft. 


(a) (0) () (d) 
% % % % 
42 34 33 44 
45 43 40 ay) 
38 34 48 27 
30 39 45 34 
34 aT 39 40 
39 39 42 42 
22 39 47 29 
32 47 48 42 
32 38 38 40 
34 45 44 36 
Average 34°8 39:5 42:4 37-3 
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Once the values are set out in this way the first step is to take an 
assumed average and retabulate the data minus this value. In the 
present case, to illustrate that the average assumed does not have to 
be close to the actual average, let this assumed value be 35. The 
retabulation will thus be on the basis of ‘items less 35’ while the 
squaring of the data for the calculation of the sum of the squares of 
the deviations will be ‘squares of “items less 35” ’, as follows (Stage 


ID. 
STAGE II 
Items less 35 
a b c d 
(eS) 9 
10 8 5 4 
3 =1 138 
—5 4 10 —-1 
—] 2 4 5 
4 4 7 7 
—13 4 12 —-6 
—3 12 13 7 
—3 3 3 5 
—1 10 9 1 
Total —2 45 74 23 


Total 


Squares of 


‘items less 35’ 


a b Cae 


49 1 4 81 
100 64 25 16 
9 1 169 64 
25 65 100 1 
1 zh SI 225) 
16 16 49 49 
169 16 144 36 
9 144 169 49 

9 9 Sh DSS 

Je LOOM St 1 
388 371 766 347 


From these retabulated data the following calculations must first 


be made (Stage III). 


STAGE III 


T (sum of sample totals) 

= —24+45 + 74 + 23 = 140 

N (total items) = 10 x 4 
= 40 


C.F. (correction factor because 
of ‘assumed Sr 

i nae 
ON a 


Total sum of the squares 
= sum of sample totals — correction 


factor 


= (388 + 371 + 766 + 347) — 490 


= 1,382 


Degrees of freedom = numbers of 
items less one = N — 1 = 40 — 1 


= 39 


From these simple calculations it is now possible to proceed to the 
fourth stage of the analysis in which the total sum of the squares and 
the degrees of freedom are allocated to ‘between sample’ and ‘within 
sample’ conditions respectively. 
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STAGE IV 

“Between sample’ conditions ‘Within sample’ conditions 

sum of squares = total sum of 
Squares — ‘between sample’ 


where n = no. of items in each sample, and sum of squares 
a to d = totals of each sample, = 1,382 — 313 = 1,069 


1 
sum of squares = -(a® + b? + c? + d?) — CF. 
n 


komt 
1.€, io + 45? + 742 + 232) — 490 
_ 4+ 2,025 + 5,476 + 529 


10 490 
8034 
= hou ee 40 3B 
degrees of freedom = number of degrees of freedom = total 
samples less one = 4 — 1 = 3 degrees of freedom — ‘be- 


tween sample’ degrees of free- 
dom = 39 — 3 = 36 


These values can now be set out in the fifth stage of the analysis, 
and Snedecor’s F Test applied to them. 


STAGE V 
Source of Sum of Degreesof Variance 
variance squares freedom estimate 
(@) (6) (c) (b/c) 
(i) between sample hig} 3 104-3 
(ii) within sample 1,069 36 29-7 
Snedecor’s F — St02ter variance estimate _ 104-3 3-51 


lesser variance estimate 29-7, = 


In the sixth and final stage of the analysis this F value must be 
checked against Table XIX. With 3 degrees of freedom for the 
greater variance estimate and 36 for the lesser variance estimate, the 
appropriate F value at the 5% level is between 2-6 and 3-1, while at 
the 1% level the limits are 3-8 to 4-9. The calculated value for the 
present example is 3-51 which therefore falls between the 1% and 5%, 
levels of probability. This means that it is unlikely that differences as 
great as this would occur ‘by chance’ and therefore the observed 
differences are at least ‘probably significant’ and possibly even truly 
significant. Thus it would be reasonable to assume that the differences 
in the percentage of land of the vills under woodland between the 
four samples chosen reflect the basis on which the samples were 
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chosen, i.e. that altitudinal differences in the location of the vills 
influenced the amount of woodland in the vills. 


Shorter Method of Assessment 


As with many other methods, however, a shorter version of the 
analysis may well provide an answer within a reasonable degree of 
accuracy. At least such shorter methods may indicate whether the 
full analysis is desirable or worthwhile, while at times it may provide 
almost as useful an answer as the full method itself. In the analysis 
of variance the shorter version is simply taking the two samples with 
the most extreme mean values, i.e. the one with the lowest mean 
value and the one with the highest mean value. In the present 
example these would be samples (a) and (c) ,the other two samples 
(b and d) being ignored for this purpose. The whole analysis can 
then be carried out on the basis of these two samples only. The 
resultant calculations are presented below. 


Items less 35 Squares of ‘items less 35’ 

@) (.) (a) (c) 
Total —2 74 Total 388 766 
T=74-—2=72 Total of sum of squares 
N= 10 x 2= 20 = 388 + 766 — 259 = 895 

STR T2H >, 5,184 Degrees of freedom 
CTcwamee doo =20-1=19 
‘Between sample’ ‘Within sample’ 

2 2 = —_ 
ad gD a+ CR. sum of squares = pe 289 
4 + 5,476 i 
Be eer ep) sas 0s 
10 

= 289 
degrees of freedom = no. of samples degrees of freedom = 19 —1= 18 
minus one = 2—1=1 aaa 
Source of Sum of Degrees of Variance 
variance squares freedom estimate 
“Between sample’ 289 1 289 
‘Within sample’ 606 18 33°7 


289 
Sned "s F= — = 85 
edecor’s 337 8 
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If this value is referred to Table XIX it will be found that it fits 
almost exactly the requirements specified for the 1% level conditions, 
i.e. if the greater variance estimate is 1 and the lesser is 18 the F 
number should lie between 8-1 and 10, though nearer the former. 
This shortened version thus gives a similar answer to the one obtained 
by the full version. On the other hand, it must be stressed that this 
shortened version presents conditions in a biased way, tending to 
overstress the probability of a significant difference. It is therefore 
necessary to be cautious in interpreting the results of the shortened 
version. To be on the safe side, if the shorter method indicates a 1% 
level of significance it is as well to consider it as indicating that at 
least the 5% level of significance applies, although the 1% level may 
in fact result from the full analysis as well as from the shorter 
method. If, however, the shorter version only returns a 5% level of 
probability it is desirable to apply the full analysis, for there is every 
likelihood (though not an absolute certainty) that the full range of 
differences will not reach the ‘probably significant’ (i.e. 5%) level. 
As in all such short cuts, this one must therefore be used carefully, 
be interpreted intelligently and cautiously, and be followed by a full 
analysis if there is the least possibility of a false answer being obtained 
by the shortening of the calculations. 


Use with Samples of Different Sizes 


A third general example may help both to reinforce the under- 
standing of the technique, and to illustrate further the sort of problem 
with which it is possible for this analysis to deal. Moreover, one or 
two minor differences can also be introduced. Suppose that in a 
rather undeveloped part of the world it is decided to try to assess by 
sampling methods the population per village. In the area under 
study it is known that four different tribal groups exist, and it is 
suspected that the size of the village unit may vary with the tribe. 
The sample is stratified proportionately to the known number of 
villages occupied by these four different tribes, with the results set 
out overleaf. As can be seen, not only does the average population 
per village vary with the tribe, but also the size of the sample differs 
from one tribe to the other. The question is therefore whether or 
not the tribal groupings significantly affect the average population 


per village. 
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Tribe A Tribe B Tribe C Tribe D 
150 150 300 350 
550 500 550 800 
250 400 400 300 
150 350 250 350 
250 350 700 
450 550 550 
450 
500 
Average 
population 
per village 275 350 400 500 


The analysis is carried out in the same way as before, except that 
the difference in the size of the samples must be remembered. The 
assumed mean can in this case be 400, and the values retabulated as 


follows: 


Items less 400 
A 


Squares of ‘items less 400’ 


B (e D A B Cc D 
—250 —250——100 —S0 62,500 62,500 10,000 2,500 
150 100 150 400 22,500 10,000 22,500 160,000 
150) 0 0 —100 22,500 0 0 10,000 
—250 —S50 —150 —50 62,500 2,500 22,500 2,500 
— 150 50 300 22,500 2,500 90,000 
50 150 150 2,500 22,500 22,500 
50 2,500 
100 10,000 

Total —500 -—300 0 800 Total 170,002 100,000 80,000 


T = 800 — 800= 0 
N=4+6+6+8= 24 
02 
Cra ry 0 (ie. the actual mean is 
also 400) 


“Between sample’ 

sum of squares = 

(—500)? | (—300)? | (0)? 
4 6 6 8 


(800)? 


= 62,500 + 15,000 + 0 + 80,000 — 0 


== 157,500 
degrees of freedom = 4—-1=3 
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300,000 


Total sum of squares = 650,000 — 0 
= 650,000 


Degrees of freedom = 24 — 1 = 23 


‘Within sample’ 
sum of squares = 650,000 — 
157,500 


492,500 


degrees of freedom = 23 — 3 = 20 
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Source of Sum of Degrees of Variance 
variance squares freedom estimate 
“Between sample’ 157,500 3 52,500 
“Within sample’ 492,500 20 24,625 
52,500, 13 
A650 


If the requisite degrees of freedom are read on the 1% and 5% 
tables (3 degrees for the greater variance estimate and 20 degrees for 
the lesser) it will be seen that the following F values are required 
1% level F=49 
5% level P= 3} 

Observed value F = 2:13 


Thus the observed value is clearly one which indicates a ‘chance’ 
occurrence of greater than 5%, i.e. it will occur ‘by chance’ too 
frequently to allow much reliance to be placed on the significance of 
the differences between the four sample groups under study. In this 
case, therefore, despite what may appear to be quite large average 
differences between the tribal groups specified, in terms of the 
average population per village, these differences cannot legitimately 
be classed as even ‘probably significant’. The different sizes of the 
samples obviously affect this answer to some extent, and an increase 
in the sampling fraction may well lead to a rather different answer. 
The analysis of variance presents a reasonably clear-cut verdict 
however, that on the basis of the sample data given above the dif- 
ference cannot be classed as a significant one. In a problem such as 
this the shorter method could well have been employed first of all to 
see whether it was worth making the full analysis, although the 
different sizes of the samples taken makes this a risky process. If the 
two extreme cases (A and D) had been taken, however, an F value 
of 4:35 would have been obtained, with degrees of freedom being 1 
for the greater variance estimate and 10 for the lesser (the reader may 
check these calculations readily from the values set out above). With 
these degrees of freedom the following values are critical: 

1% level F = 10-0 

5% level F= 50 

Observed value F= 4:35 

Thus despite the differences in the sample sizes this shorter method 
still yields a verdict of the same order as, though slightly more 
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favourable than, that of the full method. It can be clearly seen that 
whether the longer or shorter method be used, the ‘analysis of 
variance’ provides a very valuable tool by which several sets of data 
may be compared, and an objective assessment be made of the 
significance of the apparent differences between them. Many more 
complex and refined analyses can also be carried out by modifications 
of this method, but recourse must be had to more advanced texts for 
examples of these. 
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CHAPTER 10 


THE COMPARISON OF FREQUENCY 
DISTRIBUTIONS 


(The Chi-squared Test) 


In the two previous chapters methods have been presented by which 
it is possible to assess the statistical significance of differences 
between sample data, these differences being reflected in the sample 
mean values and also in the variances of these samples. Many cases 
occur, however, in which absolute values such as these may not be 
available although the frequency distribution of the data, based on 
some sort of grouping, can be obtained. It is therefore of value to 
consider a method by which the significance of such sample dif- 
ferences can be assessed. Moreover, all the tests of significance out- 
lined earlier assume that the body of data fits, or approximates to, 
the normal distribution curve. When the body of data is markedly 
skew, however, it is often advantageous to effect a comparison in 
terms of the frequency distribution, even if the mean and standard 
deviation values are available in absolute terms. 

The method by which such comparisons may be made is known as 
the Chi-squared (y*) Test. This is a relatively easy test to apply, but it 
is essential that the data being considered is in the correct form and 
that the problem is a suitable one for this method. This value—y?— 
tests whether the observed frequencies of a given phenomenon differ 
significantly from the frequencies which might be expected according 
to some assumed hypothesis. This assumed hypothesis must be care- 
fully defined and clearly thought out and understood, so that the 
results can be correctly interpreted. Furthermore, the data must be 
in the form of frequencies and NOT in absolute values. 


The Calculation of x? 


The most effective way to understand the characteristics and 
qualities of this method is to follow through several examples, so 
that the various possible difficulties are encountered and ways of 
circumventing them are seen in a specific context. At the same time 
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the general approach can be outlined so that the method can be 
applied in other cases. The first problem to be analysed in this way 
can be expressed as follows. Suppose that a study is being made of 
farm sites in relation to the characteristics of those sites. Over an 
area of diversified relief a sample consisting of 200 farms is made, 
these farms being grouped into several categories depending on the 
physical character of the site—alluvium, terrace, steep slopes, lime- 
stone plateau and sandstone plateau. The values in each category are 
set out below. 


Types of 
No. of terrain as 
Site farm sites % of all land 

1. Alluvium 10 10 
2. Terrace 100 35 
3. Steep slopes 2 10 
4. Limestone plateau 38 2S 
5. Sandstone plateau 50 20 
Totals 200 100 


Also in this table is the amount of land of each of these five types 
expressed as a percentage of all the land in the area under study. 
Clearly the distribution of farms between these different types of site 
is partially related to the amount of land of each type—thus terraces 
are the most frequent type of terrain and also contain the greatest 
number of farms, while the two types of terrain that occur least 
frequently also have the two smallest numbers of farms. On the other 
hand, the distribution of farms would also seem to partially indicate 
some preferential choice of site between these five possibilities—thus 
both the terrace and the sandstone plateau would seem to have a 
greater number of farms than their relative areas would suggest, 
while the other three sites are all under-represented to some degree. 
In trying to find an explanation for the spatial distribution of farm 
sites in such a situation, one problem which has to be solved is the 
relative importance of the two tendencies indicated above. If the 
number of farms on a given type of site is mainly a reflection of 
the frequency with which that type of site occurs then it cannot be 
argued that the characteristics and qualities of that type of site are 
factors influencing farm sites. Conversely, if the frequency with 
which farms occur on given sites is not mainly a reflection of the 
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frequency with which these sites occur, then it would seem legitimate 
to argue that the different sites possess characteristics and qualities 
which influence the siting of farms. Thus a causal relationship, 
possibly weak and possibly strong, may be established between types 
of terrain and the occurrence of farm sites. 

To test which of these two possibilities is most likely it is necessary 
first to set up a ‘null hypothesis’, i.e. it is necessary to postulate that 
the observed distribution of farm sites could be reasonably expected 
in the light of the proportions of different types of land or, in other 
words, that there is no significant difference between these five 
groups (or sites) as regards a preferential frequency of farm siting, 
the scattering of the farms simply representing a random distribution 
over the whole area. It is this null hypothesis that is tested by y?, and 
which must be carefully and adequately posed. 

This y? value is calculated as follows. Let the ‘observed’ values 
(i.e. those that actually occur) be written as O, as set out below. 
Beneath these are written the values which would occur if the postu- 
lated null hypothesis really applied to the full. These are known as 
the ‘expected’ values, written as E. 


Group1 Group2 Group3 Group4 Group 5 
‘Observed’ frequency O, O. O; O, O; 
‘Expected’ frequency £, 185 Es E, E; 


From these the z? value is obtained by the formula 


(0 — 5)? 
pS E 
(Ein, (Gp 8)” ALON B®) (0p = FD? 
be =a Des 2 Zo S ee ee z 
eR Gaet | eps Pee ah ee 
CHE) 
‘aa 


Thus for each category the amount by which the observed frequency 
differs from the expected frequency is squared, and then related to 
the expected frequency itself. This is akin to the procedure for 
calculating the variance and then relating it (or the standard devia- 
tion) to the mean value from which actual conditions varied (see 
p. 39). When these figures for each category are summed this gives 
the total sum of the squares of the difference between observed and 
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expected values. Each difference is divided by the appropriate value 
because the value from which the observed data deviate is different 
in each group, in contrast with the deviations from the mean just 
mentioned. The division by E is therefore to eliminate this variable 
so that the summing of the separate squares becomes legitimate. 
Once the y? value is obtained in this way, it can be referred to the 
appropriate table or graph and read off against the degrees of 
freedom. These, as in techniques discussed earlier, are obtained by 
subtracting 1 from the number of-occurrences, i.e. the number of 
groups or categories. This table or graph will yield a value which 
gives the percentage probability that the null hypothesis is correct. 

In the actual example which was commenced earlier, the observed 
values are available, while the expected values (based on the null 
hypothesis on p. 153) would be proportional to the amount of land 
in each category. Thus, as 10% of the land is alluvium it is to be 
expected, on the basis of the null hypothesis, that 20 out of 200 farms 
would be on alluvium. These expected values are given below with 
the observed values. 


1 2 3 4 5 
oO. 10 100 2 38 50 
EO Sp 10 20 50 40 

Oe Fee 10 30 —18 12 10 

(O—E)?= 100 900 324 144 100 

(O -- E) 

———= 50 129 16-2 2-9 2-5 


For each category it is a simple matter to calculate (O — £), 
O — E)? 
(O — E)*? and ieee) as has been tabulated above. These can then 


be entered into the formula 


1.6, y° = 5-US- 12:9 16:2 -- 29 = 25 

= 39-5 
This value reflects the sum of the squares of the deviations of observed 
conditions from the expected conditions. 
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As for the degrees of freedom, these are obtained by N —1 
= 5 — 1 = 4. By referring to the graph in Fig. 28 it will be seen that 
a x” value of 39-5 with 4 degrees of freedom yields a probability value 
of less than 0:1%. This means that the null hypothesis on which the 
comparison was based would produce differences as great as this ‘by 
chance’ less than one time in a thousand, i.e. there is a 99-9% 
probability that the observed differences are not the result of a 
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Figure 28. Graph for the Chi-squared Test (based on data in D. V. Lindley and 
J.C. P. Miller, Cambridge Elementary Statistical Tables, Table 5) 


chance occurrence within the null hypothesis. This means that the 
percentage probability that farm sites are distributed between the 
different sites in relation to the frequency with which these sites 
occur is very small indeed, and it would seem almost certain that the 
characteristics and qualities of the type of terrain do significantly 
affect the frequency with which farm sites occur. 

This apparently inverted way of approaching the problem is 
necessitated by the characteristics of the method itself. It is always 
necessary to set up a suitable null hypothesis on the basis of which 
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the expected values can be obtained. The aim is then to assess the 
probability that the observed conditions are a reflection of the 
expected ones. If this is largely true then probabilities of between 
95% and 100% may be obtained. If, on the other hand, this is 
largely untrue, then low probability values occur. If these are 5% or 
less, then it is justifiable to say that the inverse of the null hypothesis 
is probably true, while if the value is 1% or less then the likelihood 
of this inverse relationship being true is very great. 

A further illustration of the application of the y? Test in this its 
simpler form may help to show more clearly both how the null 
hypothesis can be framed in different circumstances and also the 
various problems that can be analysed in this way. For example, let 
it be assumed that a given industrial area exports its products by 
four types of route—railway, road, sea and canal. A sample of 120 
items was taken, each item being a unit valued at the same amount. It 
was found that of these units, 40 were sent by railway, 35 were sent by 
road, 30 by sea and 15 by canal. The problem raised is whether these 
differences are likely to be the result of chance, or whether they reflect 
some valid difference between these types of routes as export media. 

In this case the null hypothesis can be framed in the terms that 
there is no difference between these media as regards their importance 
for exports. This being so, it could be expected that the exports would 
be shared equally between them, i.e. that 30 of these units would be 
sent by each method. This can thus be entered in the tabulation below 
together with the observed values and the components of y? can be 
evaluated. 

ROUTES 


(units exported) 
Railway Road Sea Canal 


O= 40 35 30 15 
E= 30 30 30 30 
OE = 10 5 0 —15 
(ORSE)e 100 25 0 225 
(OsE) 
= 3:33 0-83 0 15 


Degrees of freedom = N—1=4—1=3 


From these data 7? can be obtained by 
4 = 3-33 + 0:83 - 0 + 75 =1 1-66 
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From Fig. 28 it can be seen that with this value of 7° and 3 degrees 
of freedom the probability that such a difference as the observed one 
could have occurred by chance is about 1%. With a probability as 
low as this that the null hypothesis is correct, it is justified to assume 
that a significant difference does exist between the four possibilities, 
on the basis of the available evidence. 


Influence of Size of Sample 


The limitations imposed by ‘the available evidence’ are the same 
as those resulting from the ‘size of the sample’ in examples in earlier 
chapters. In the y? Test, too, changes in the size of the sample can 
affect the resultant conclusions, even if the proportion of occurrences 
within any one category remain unaltered. Thus suppose that in a 
survey of the agriculture of an area of diversified relief, where upland 
is interdigitated with lowland, a stratified random sample were made 
of the farms, using the methods outlined in Chapter 7. The strata 
were two in number, these being upland and lowland groups, and 
the size of the resultant sample was 50 farms, 30 in the lowlands and 
20 in the uplands. These 50 farms were then classified as to whether 
or not dairying entered into their economy, and it was found that 16 
lowland farms and 4 upland farms included some dairying activity. 
The question that is then posed is whether it is justifiable to assume 
that there is a difference in the role of dairying as between upland 
and lowland farms. The null hypothesis must clearly be that there is 
no difference between upland and lowland in this connection, and 
the probability that the observed differences are the result of chance 
must be assessed by the y? Test. The observed values are set out 
below, while the expected values on this basis are obtained by sharing 
the 20 farms in the ratio of 3 : 2 between lowland and upland groups 
respectively. 


No. of farms No. of farms 


visited with dairying 
(a= Ey 
© © ©-5 o-H —— 
Lowland farms 30 16 12 4 16 1-33 
Upland farms 20 4 8 —4 16 2:0 
x? = 1:33 + 2:0 = 3-33 
Degrees of freedom = 2 — 1 = 1 
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From Fig. 28 it will be seen that with these values for y* and 
degrees of freedom the observed differences could occur by chance 
with a percentage probability of between 5% and 10% even if there 
were no difference between lowland and upland in terms of the role 
of dairying in the farm economy. Thus as the y? value lies on the 
wrong side of the 5% probability line, the available evidence does 
not really justify a statement that these two groups of farms do reflect 
different conditions. As the 7? value is not too far removed from the 
5° line, however, it would be worth increasing the size of the 
sample to see whether this can clarify the issue. 

It may therefore be decided to increase the size of the sample by 
50%, so that 45 lowland farms and 30 upland farms were studied. If 
the same proportions of each were found to include dairying as in the 
earlier smaller sample, this would mean that 24 lowland farms and 
6 upland ones would fall into this category. These, and the expected 
values obtained by sharing the 30 dairying farms in a ratio of 3:2 
again, are given below and the usual calculations made. 


No. of 
No. of farms farms with 
visited dairying 
=. 2 
© © ©o-m o-m © 
Lowland farms 45 24 18 6 36 2 
Upland farms 30 6 12 —6 36 3 


v=2+3=5 
Degrees of freedom = 2 —1=1 


By referring these values, based on a larger sample, to Fig. 28, it 
will be seen that the probability of such a difference being a chance 
occurrence has been reduced to 2:5%. It is therefore highly probable 
(97:5%) that the observed difference is not a chance occurrence, that 
the null hypothesis that there is no difference between upland and 
lowland farms is not justified, and that therefore the observed 
difference between lowland and upland farms is ‘probably signi- 
ficant’ at the 2°5% level. Thus the larger size of the sample, with 
the proportions remaining the same, clarifies the position. It 
must be realized, however, that in reality the new larger sample 
will almost certainly yield proportions of farms with dairying which 
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differ from those of the first sample. They are more likely to be 
nearer the true proportions, however, because of the larger size of 
the sample. 


The Comparison of Two or More Variables 


So far in this consideration of the y? Test only simple examples 
have been used, in which there has been in each case only one set of 
variable conditions—the frequency of farm sites upon different 
terrains; the frequency of exports along different routes; and the 
frequency of dairying between different groups of farms. In many 
problems, on the other hand, it may be desirable to compare two or 
more different sets of variable conditions, where, for example, two 
sets of data have different frequency distributions and it is necessary 
to ascertain whether these differences in frequency distributions are 
statistically significant or not. 

A specific problem may help to clarify the matter. Suppose that a 
study is being made of the distribution of woodland over an area. It 
is seen that it is distributed very irregularly and the study is aimed 
at presenting some valid explanation of this irregular distribution. 
The woodland distribution is therefore compared with the distribu- 
tion of various possible causative factors, amongst which is the form 
of land tenure. On consideration of the individual land holdings, 
which totalled 300 in all, it was seen that 90 of them were private 
estates while 210 were tenant farms. The obtaining of exact acreages 
and percentages of the holdings under woodland not proving possible 
for one reason or another, the holdings were instead grouped as to 
whether less than 10%, 10-20% or more than 20% of the holdings 
was under woodland. The values obtained are set out below, these 
representing the ‘observed’ frequencies to be used in the y? Test. 
Furthermore, the differences between the frequency distributions 
for the two types of land holding can also be clearly seen. 


No. of holdings with given 
% of holding under woodland 
>20% 10-20% <10% Totals 


Private estates 30 45 15 90 
Tenant farms 30 105 Ty 210 
Totals 60 150 90 300 
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To obtain the ‘expected’ frequencies so that the y? Test can be 
applied requires a certain amount of simple calculation plus an 
understanding of what is involved. It can most readily be explained 
by working in terms of both the above example and an idealized 
case at the same time. In both it is necessary to remember that the 
totals for both the columns and the lines in the tabulation are fixed 
by the observed conditions, as was also true in the simpler examples 
considered earlier. The idealized case which will be used is set out 
below. The columns are shown by a, J, c etc. and their totals by— 
x, y, Z, while the lines are indicated by A, B etc. and their totals by 
—Y, Z. The overall total of items in the table is given by N. 


b c Totals 
’% 
A etal Y 
N 
B Ju, 
Totals x y Zz N 


The first question to be asked is ‘what is the probability of values 
occurring in Line A (or Private Estates)?’ Clearly from the specific 


op A | ae 
example it is 300° °° the total of the line divided by the overall total 


number of occurrences. In the idealized case this would be . The 


second question is then ‘what is the probability of values occurring 

in Column a (or >20% woodland)? Equally obviously from the 
nse OO 

specific example this is 300° 1 the total of the column divided by 

the overall total number of occurrences. In the idealized case this 


x 
would be Wr From these two questions arises the third, i.e. ‘what is 


the probability of values occurring in both Line A and Column a, i.e. 
of falling in Square Aa? This is obtained by multiplying together 
the two probabilities set out above, by applying the ‘Multiplication 
Law’ mentioned on p. 62. Thus in the case of the specific example, 
the probability of a holding falling into the category of a private 
estate with more than 20% of the land under woodland would be 
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90 e 60 _ 5,400 fer 4 
300 =300 =90,000 900 
In the case of the idealized example this would be 
¥x x 


= 0:06 


¥ 
If these values (006 and a) give the probability of an occurrence 


falling in Square Aa, then the actual frequency or number of occur- 
rences can be obtained by multiplying this probability by the overall 
total of occurrences: 


6 
1. i —~ O-—= 
i.e. in the specific case i00 * 30 18 


ss EX 
in the idealized ae = —— 
Ei ine Mealized Case — x N N 
In other words, the expected frequency in any given square is simply 
obtained by multiplying the Column Total by the Line Total, and 
dividing this by the Overall Total of Occurrences. Thus in the 
idealized example the expected frequency in Square Bc would be 


Zz : Yz 
i and in Square Ac, NT" 


From this very necessary digression to explain the mechanism by 
which the expected values can be calculated, it is now time to return 
to the calculations in the specific example introduced above. The 
expected frequencies for this example are calculated as follows: 


% of holding under woodland 


>20% 10-20% =10% Totals 
Pri 90 x 60 90 x 150 90 x 90 90 
rivate estates 300 300 300 
Tr f 210 x 60 210 <x 150' 210 x 90 710 
enant farms 300 300 300 
Totals 60 150 90 300 
Private estates 18 45 27 90 
Tenant farms 42 105 63 210 
Totals 60 150 90 300 
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With both observed (p. 159) and expected values thus obtained it is 


2 
possible to calculate y? by the same formula as before, i.e. > =a 


(30 — 18)?  (45—45)?_ | (S— 27)? , (30 — 4)" 
cima? Cec ta 
(105 — 105)? (75 — 63)? 

105 63 

144 0 144 144 0, 144 

ae As op eae As) ost 

= 8-0 + 5:33 + 3-43 42-29 

= 19-05 


When referring this value to the 7? Graph the degrees of freedom 
are also required. These must be obtained both for the columns and 
the lines, employing the usual method of (7 — 1) in each case. Thus 
if 7, is the number of items along each line then the degrees of 
freedom will be (n, — 1). Equally, ifn, is the number of items in each 
column then the degrees of freedom will be (7, — 1), while the overall 
degrees of freedom will be the product of these two values, i.e. 
(7, — 1)(n_ — 1). In the present case these two values are 3 — 1 = 2 
and 2 —1=1 respectively, so that the product, ie. the overall 
degrees of freedom, is 2 x 1 = 2. The reader may check this value 
by inserting any value into two of the six squares ensuring that they 
are not both in the same column. With the totals remaining constant 
it will be found that the other four values must automatically follow. 
If now a x” value of 19-05 is read off against 2 degrees of freedom on 
Fig. 28 it will be seen that the relevant probability value is Jess than 
01%. This means that the probability that the observed differences 
could occur by chance is less than 0-1%, and it is therefore virtually 
certain that they are not chance occurrences. Rather it can be said 
that there is a highly significant difference between private estates 
and tenant farms in terms of the proportion of their holding that is 
likely to be under woodland. 

In considering this example the thought may well have occurred 
that if the actual amount of woodland on each holding had been 
known, then the tests of significance outlined in the previous two 
chapters could have been applied. While this is true, the y? Test has 
several points to recommend it. First, the data may only be available 
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in a grouped form and actual values not known, as was specified in 
this example at the outset. Second, such tests as Student’s t make 
assessments on the basis of two parameters—arithmetic average and 
standard deviation. y?, in contrast, compares the whole frequency 
distribution, especially if a large number of groups are used. This is 
very important when the frequency distribution of the sets of data 
being compared are not normal, for all formulae based on averages 
and standard deviations assume a normal or near-normal distribution 
curve. In the present example, the data for private estates is negatively 
skew, and that for tenant farms is positively skew, while the overall 
distribution is but slightly skew in a positive sense. For these reasons, 
the vy? Test is to be preferred in this and similar cases. Cases of this 
sort are illustrated by the following two examples, which again 
involve the comparison of two sets of variables. 

One sort of problem that can be studied in this way is represented 
by the following set of conditions. Across an upland area a study is 
being made of the depth of the peat layer and the angle of slope of 
the land where such depths are measured. A simple inspection sug- 
gests that peat is markedly deeper where the angle of slope is less than 
5° than where it is more than 5°. Actual values of recordings under 
these two sets of slope conditions vary within certain limits, as is shown 
below where the observations are grouped according to peat depth. 


No. of borings with a given peat depth 


>6 ft. 3-6 ft. <3 ft. Totals 
<S°slope 10 20 6 36 
>S°slope 3 30 21 54 
Totals 13 50 27 90 


Once again these three distributions are each skew, and not all in the 
same direction. Furthermore, the data are not in terms of absolute 
values but in groups or categories. When trying to test whether a 
significant difference does occur between slopes of less than, and 
more than, 5° the ? Test is the best to employ. The null hypothesis 
in this case is that there is no difference between conditions on 
the two different sets of slopes, and expected values can there- 
fore be simply obtained on this basis by the method used in the 
last example. In other words, the expected value at any place is 
obtained by multiplying the ‘line total’ by the ‘column total’ and 
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then dividing by the ‘overall total’. The resultant values are set 
out below. 


Expected values of 
no. of borings with a given peat depth 


>6 ft. 3-6 ft. <ul. Totals 
<S°slope 5:2 20 10-8 36 
>S5°slope 78 30 16-2 54 
Totals 13 50 27 90 


The calculation of 7? thus becomes 
4-82 4:82 4.82 4.82 
52) 1 108 y ede, allo 
As for the degrees of freedom, these are (3 — 12 —1) =2 X1=2. 
By reference to Fig. 28 it can be seen that the probability of these . 
values occurring by chance is between 0:1% and 1-:0%. With this 
small probability of a chance occurrence it can be said that the 
observed differences are almost certainly not due to chance and that 
instead they represent a significant difference in terms of peat 
accumulation between slopes of less than, and more than, 5°. 

As a final example a different theme can be considered. In analys- 
ing the industrial character of two large towns, the question of the 
size of industrial establishment may be considered, and one way of 
doing this is to use the numbers of people employed by each firm. 
Frequently, however, it is not possible to obtain precise numbers in 
such studies, though it is usually possible to allocate each firm to a 
category which consists of a range of values. Thus in the present 
example it is possible to allocate firms to the following four groups— 
those that employ 2,000 or more people, 500 to 1,999 people, 100 to 
499 people, and less than 100 people respectively. Having done this, 
with the observed values set out below, it is necessary to test whether 
any significant difference exists between these two towns in terms of 
the size of industrial firms. 


= 4-43 + 2:13 + 2-95-- 1-42 = 10°93 


No. of firms employing given 
numbers of people 


2,000+ 500- 100- <100 Totals 
1,999 499 
Town A 10 250 350 50 660 
TownB 8 240 400 72 720 
Totals 18 490 750 122 1,380 
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The null hypothesis in this case would be that there is no significant 
difference between the two towns, and the expected values can there- 
fore be allocated in proportion to the various totals given above, as 
has been done in the last two examples. The expected values for these 
conditions are given below, and can be checked by the reader (for 
method see p. 161). 


Expected no. of firms employing 
given numbers of people 


2,000+  500- 100- <100 Totals 
1,999 499 
TownA 8-6 234 359 58-4 660 
TownB 9-4 256 391 63°6 720 
Totals 18 490 750 122 1,380 


From these values y? can be calculated, being in this case 

0-23 + 1:09 + 0-23 + 1:21 + 0-21 + 1-00 + 0-21 + 1-11 = 5-29 
Degrees of freedom are here 3, by the same means of counting as in 
earlier examples. Reference to Fig. 28 indicates that the observed 
conditions could have occurred by chance with a probability of more 
than 10%. This being so, it is unjustified to postulate a difference of 
any significance between these towns in terms of size of industrial 
firms. 

It can thus be seen that many and varied problems of a geographi- 
cal nature can be analysed by means of the y? Test, which enables an 
objective assessment to be made. Furthermore it is often of great 
value in connection with data obtained from mapped distributions, 
where specific quantitative values are not available. Provided that the 
frequency with which conditions fall into specified categories can be 
established, then this test can be applied. The examples used have all 
been of a fairly simple and straightforward character, but more 
complex problems could be analysed by the use of this method. It will 
be found, however, that the working principles follow those outlined 
here. The more complex the problem, on the other hand, the more 
important it is that the terms of the null hypothesis be framed with 
care and the interpretation of results be done intelligently. In all 
cases, simple or complex, it must be remembered that the test is 
applied to frequencies, not to absolute values. One final point, which 
has been observed without comment in the above examples, is that 
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the v2 Test does not really work if the expected frequency in any cell 
is less than 5. If this is found to occur, then two or more cells must be 
grouped together until this expected value of 5 is obtained. 
Provided that the necessary care is taken, along the lines indicated 
above, this test can prove of considerable value in geographical work. 
Whether it, or some other test of those already outlined, be used in 
any particular case, however, will depend on the nature of the data, 
the degree of accuracy required, the purpose for which the analysis 
is being made, and sometimes simply on the matter of preference. In 
the long run, it is only experience which will provide a sound guide 
when making such a choice between the various possible methods of 
testing the significance of the differences between several sets of data. 
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CHAPTER 11 


THE PROBLEM OF CORRELATION 


In the previous three chapters methods have been presented by which 
the differences between sets of data can be tested as regards their 
statistical significance. The aim in all these cases was to assess 
whether the differences that were observed could have occurred by 
chance sufficiently frequently for some doubt to be cast on the 
validity of the apparent differences, or whether the probability of 
their having happened by chance was so slight that these observed 
differences could legitimately be accepted as justified and significant. 
In all these cases it was the overall characteristics of the various sets of 
data that were under consideration rather than the detailed character- 
istics and their changes. In many problems, however, there is the 
need to compare sets of data in terms of the extent to which a change 
in one is or is not reflected by a change in the other set. This neces- 
sarily implies that the individual items of the two sets of data co- 
exist either in time or space, such that the possibility of interrelated 
changes can be considered. In such a problem an index is required 
that reflects the degree to which changes in direction (+ or —) and 
magnitude in one set of data are associated with comparable changes 
in the other set. An index of this sort is provided by what is termed 
the Product Moment Correlation Coefficient, and in the following 
pages this coefficient will be employed in the study of several 
problems. 


Calculation of the Product Moment 
Correlation Coefficient 


A simple case may be provided by comparing ten years of cereal 
yields for two districts of a country, as are set out below. It will be 


Years 
Districts = 1) 20er3* * 4 Se 6 7 ae, © 9: 10--Average 
a. 25% 26; 34 25) 24-28 £27, 429 628229275 
b. Qhedy. 334 21), 19,022,260 22. 26.226. 23'5 


Cereal yields in bushels per acre 
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seen that production varies from year to year in each district, that 
these variations are not always the same for the two districts, and 
that the average yields around which yearly values vary are them- 
selves different. To try to assess from this by simple inspection the 
extent to which fluctuations in the yields of one district are reflected 
in those of the other district would lead to no more than a generalized 
impression, and in more complex situations even that would not be 
possible. 

The fact that these fluctuations take place about different mean 
values increases the problem of comparison, but this can be eliminated 
by calculating the amount and direction by which each item differs 
from its respective average. The results are set out in Table XX under 
the headings (a — 4) and (b — 5). This tabulation does give a clearer 
picture. It can be seen, for example, that in seven of the ten years 
the two sets of data differ from their respective means in the same 
direction, though rarely by the same amount; in the other three years 
one district has above-average yields while the other has yields below 
the average. 


Table XX 


Tabulation of data for calculating the product moment correlation 
coefficient 


a b (a — @) (6—b) (a—a(b—Jd) 
+- ve, 
25 21 2: == es) 6:25 
26 17 ale) — 16:5 9215 
34 35 +6°5 Sip ties) 74:75 
25 21 2:5 = Rs) 6:25 
24 19 =o) Aes 15:75 
28 22 +0°5 = Iles 0:75 
27 26 =(05) ap ORS 1325 
29 Pip aplles == ile 22s 
28 26 Ors am RS 1:25 
wy, 26 +1°5 ap 5) 3°75 
@=275 b=23°5 IVE T SY CaS 
= +1135 


To find from these data a value which, for any one year, will 
express the combined variation from the mean, the simplest method 
is to multiply the two separate deviations together, i.e. (a — 4) 
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(6 — b). This product gives positive values in some cases and nega- 
tive in others, and these are entered separately in the tabulation 
(Table XX). Thus under the positive values of (a — a)(b — 5) fall all 
those years in which the deviation is in the same direction in the two 
districts, whether this be above or below the average, while the 
negative values of (a — a)(b — b) are for those years in which the 
two deviations are in opposed directions. This is a basic reason for 
multiplying these deviations rather than summing them. If then the 
separate values under (a — @)(b — 6) are summed, and the total of 
the negative values subtracted from the total of the positive ones, 
then the total deviation is obtained. In the example being considered, 
this can be seen from Table XX to be +113:-5. If this is now divided 
by the number of pairs of values being compared, then the average 
deviation is obtained. 

Thus, 


41375 
10 


This average of the products of the deviations of the two sets of 
data from their respective means is based on actual changes, and is 
known as the co-variance of these sets of data. Thus, whereas when 
finding the variance of one set of data the mean is obtained of the 
sum of the deviations squared, in this case the mean is obtained of the 
sum of the product of two deviations. This measure of the relationship 
between conditions as they occur can then be compared to the overall 
deviations about the mean divorced from the time-scale itself. For 
any one set of data this is represented by the standard deviation. 
Here, with two sets of data, the same process is carried out as in the 
calculation of the co-variance, i.e. instead of converting the standard 
deviation to the variance for the one set of data by squaring it, the 
two standard deviations are multiplied together. The co-variance is 
expressed as a proportion of this value, thus giving the product 
moment correlation coefficient, 


~¥(a— ab —b) = = +11-35 


1 _ 
“3 (a— ab —5) 
i.e. the correlation coefficient (r) = — 


Og-9p 


The possible values of this coefficient lie between +1 and —1, the 
former indicating that the two sets of data vary in the same direction 
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and by the same amount on all occasions, while the latter indicates 
that although the amount of variation is always the same the direc- 
tion of that variation is always opposed. Thus if the means for the 
two sets of data were the same, as also were the standard deviations, 
ie. d= band o, = 05, while a perfect correlation existed all the way, 
then the following transpositions in the formula could be made 


BSitnt-arai) asia — aya — 4) '3@—a)! 
iE tes pm is 


 ———~3 


Og 0» Og.O4 ae 
The top line of this is the expression for the variance, i.e. 0,” so that 
1 ‘ 
2 (a — a)? 2 
e: ——=—=41 


Cu (iy 


Equally, if a perfect inverse relationship existed a similar calculation 
would yield r = —1. In nearly all cases, however, actual values for r 
will lie within these limits. 

Thus, in the example of crop yields begun above, the correlation 
coefficient would be obtained as follows: 


{\ = 
De elias eee 
Oq.0 sls KAS last 

The co-variance value has already been calculated, while the reader 
can check the standard deviation values by any of the methods out- 
lined in Chapter 3. This coefficient of +0-87 clearly implies that 
there is a high degree of positive correlation between these two 
districts in terms of fluctuations in cereal yields. Whenever values 
increase in one district there is a distinct tendency for them to 
increase also in the other, though this tendency is neither absolute 
nor of uniform magnitude. In general terms, coefficients of between 
+0:5 and +1 and between —0-5 and —1 are fairly significant, while 
if values lie between —0-5 and +-0°5 then little significant correlation 
is to be expected. If a value of zero is obtained, this indicates that the 
two sets of data fluctuate completely independently of each other and 
that no correlation exists at all. To be safe in making any of these 
interpretations, however, the statistical significance of the correla- 
tion coefficient should always be tested by Student’s t Test, to assess 
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r= 


= +0:87 
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the probability of it having occurred by chance. This theme will be 
taken up at greater length on p. 179. Furthermore, great care must 
always be taken in interpreting correlation coefficients. This value 
of +-0-87, for example, does not indicate why this relationship exists; 
it does not prove that the same causes have produced the same 
results, for there may well have been different factors at work produc- 
ing these changes in the two areas. All it does is to indicate the degree 
of statistical relationship between the observed values—explanations 
must be sought by further work. 


Alternative Method of Calculation 


Nevertheless, some indication as to whether there is likely to be 
some valid relationship for which an explanation needs to be found 
is itself a valuable aid and guide. It helps to prevent explanations 
being put forward for relationships that are more apparent than real, 
and also indicates what is likely to be the most fruitful line of 
further research. As with all these statistical methods, it is a means 
to an end, not an end in itself. This being so, it is desirable to keep to 
a minimum the labour involved in calculation for this coefficient. 
This can be effected by a method very similar to that adopted for the 
standard deviation on pp. 26-29, for as has just been indicated the 
various components of the formula for the coefficient have much in 
common with the standard deviation and variance values. 

It was shown on p. 26 that the formula for the standard deviation 


(a) 
2 2 AG 
n 


could be rewritten as 


x x? 
— x2 


(sa—4 
n 


Equally the co-variance element in the calculation of the correlation 
coefficient can be altered from 


: Lab 
ae hy to a ab 
n n 


This enables more rapid calculation, especially if the two standard 
deviations are also calculated by the shorter method. On the other 
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hand, this approach can lead to very large numbers being involved, 
and it is therefore desirable to adopt yet a further short cut. This is 
done by adjusting all the values in the two sets of data, by subtracting 
from them some number which approximates to the average of the 
set of data. It does not matter if this number is different for the two 
sets of data. These adjusted values can then be referred to as (x) and 
(y) instead of (a) and (d) as previously. Then the amended formula 
can be written as 


Gy: Oy 

The effects of this alteration of the formula and adjustment of the 
data can best be appreciated if the example of cereal yields just used 
is re-worked by this method. In Table XXI the cereal yields for 
District (a) have been adjusted by subtracting 27 from each of them, 
and these are entered under column (x). Equally, the values for 
District (b) have had 24 subtracted from each of them, and are tabu- 
lated under (y). With these simple values a certain amount of pre- 
liminary calculation must be done. Firstly the (x) and (y) columns 
must be multiplied together and entered under (xy). Then each of 
columns (x) and (y) must have each value squared, i.e. to give 


Table XXI 


Tabulation of data for calculating the product moment correlation 
coefficient by the shorter method 


x y xy x* y 
@@a—27)  (6-W~ + — + * 
—2 — 3 6 4 9 
—1 — 7 U 1 49 
+7 +11 Wil 49 121 
—2 — 3 6 4 9 
—3 — 5 15 9 25 
+1 —2 2 1 4 
0 ae 24 0 0 4 
ap = 2 4 4 4 
=-1 ap 2 1 4 
SP ae 2 4 4 4 
i> — § +111 77 233 
EC Ly x xy exe xy 
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columns (x*) and (y*), followed by the totalling of each of these five 
columns. From this stage the calculation of the coefficient can 
proceed. 

First the averages of the adjusted values, i.e. (X) and (7), are cal- 
culated by dividing the summation of each of these two sets of values 
by the number of pairs of items involved. Thus: 


>i a 
2 SS Se eee 0: 
eee Fi) Po 
xy —5 
y = — = —_ = _(). 
y n 10 s 


From these, one component of the co-variance can be obtained, i.e. 
X.Y, which here is +0°5 x — 0-5 = —0-25 
Following this the other component of the co-variance can be 
obtained. 
xuxy +111 
— = —— = +111 
10 ce 


Also from this retabulation the two standard deviations can be 
obtained again by the shortened formula from p. 26. Thus in the 
present example the following would apply: 


Boss ux oe ee 0.95 VT — 0:25 = V7-45 
n 10 
= 7)57/8) 
a oo a i ie 2 aes: 
Oy = ay? aa jy = — aN 025 = 23-3 = 0:25 = V'23-05 
n 10 
= 4-8 


In this way, by dealing only in very small numbers and without 
complicated calculations, the various components of the correlation 
coefficient formula can be obtained. These can then be entered into 
the formula, in the following way: 


x xy sb pep! 
n> 14-1 —(—0-25) 11-1 +025 11-35 
Ga’ SERBS aoe us Mer Ea 
= +0:87 
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As can be seen, this gives exactly the same answer (r = +0°87) as 
was obtained by the earlier method on p. 170. It is not simply an 
approximation that is obtained here, but a legitimate short cut in the 
process of calculation. Even with this simple example quite a con- 
siderable saving in time is effected, together with a decreased risk of 
calculation errors, and these advantages are increased the more 
complicated the data involved. Before considering a more complex 
problem, it will be as well to use this method again in the context 
of another simple set of data. 


Further Specific Examples 


Suppose, for example, that a study were to be made of the change 
in the yields of a given crop with increase in the altitude at which it is 
grown. From the tabulated values set out below an inverse relation- 
ship between altitude and crop yield exists in this particular case. It 
may be of value, however, to express the degree of this relationship 
in some more specific terms, so that it may be compared perhaps with 
the coefficient for other crops, and the product moment correlation 
coefficient would do this. 


Yield 

Altitude in bushels 
in feet per acre 

100 30 

200 30 

500 31 

700 2 

800 26 
1,000 Ph 
1,400 13 
1,500 17 
1,800 14 
2,000 12 


In this example the standard method of calculation would be a fairly 
easy process, but the modified method outlined above will be re- 
applied here. The reader may check the accuracy of the result by 
applying the longer method outlined on pp. 168-170. The method 
being used requires retabulation of the material. In the case of alti- 
tude, all the values will be reduced by 900 ft., while for crop yields 
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the reduction will be by 20 bushels. The ensuing additions to the table 
are set out below, while the necessary further calculations are also 
listed in succession. From these the correlation coefficient will be 
obtained. 


xy 
x, y + = x3 y? 
—800 +10 8,000 640,000 100 
—700 +10 7,000 490,000 100 
—400 +11 4,400 160,000 121 
—200 + 4 800 40,000 16 
—100 + 6 600 10,000 36 
+100 + 3 300 10,000 9 
+500 =F 3,500 250,000 49 
+600 — 3 1,800 360,000 9 
+900 — 6 5,400 810,000 36 
1,100 — 8 8,800 1,210,000 64 
+1,000 +20 —40,000 3,980,000 540 
ux Ly x xy DBE oa xy 
Dx 461.000 xy +20 
x SS _——- Z —— 100 1) SS ow re = _S D 
aa ‘a areca 10" 
so X.7 = 100 x 2 = +200 
x xy _ —40,000 _ 4,000 
n = 10. 
7 a 
ra 2x8 z 7 OO. 1000E 4/398,000 — 10,000 
= as = 623 
2 
oy = 2S ee a AON arte ap =o 


10 


From these values the correlation coefficient is obtained from the 
formula 


aay 
ff _ =4,000 = 200 4200 og 
ee 0 X T1443 


r= 
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Thus the very high degree of inverse correlation, which might have 
been expected anyway, receives quantitative confirmation. It is not 
for such purposes that these methods are really required, however, 
although such obvious problems do provide clear examples from 
which to work. It is rather when values are such that a rapid subjec- 
tive assessment cannot be made with any degree of reliability, that 
these calculations can be of value. One point that this past example 
does stress, however, is that this method can be used to assess the 
degree of correlation between a given phenomenon, i.e. crop yields, 
and a possible cause, i.e. change in altitude. Both this aspect of 
assessing possible causative relationships, and the application of the 
method to rather more difficult values, are reflected in the third 
example which now follows. 

Below are set out the data for annual (October to September) 
rainfall over, and run-off from, the River Etherow, for the period 
1937-1938 to 1952-1953. A causal relationship between the amount 
of rainfall over an area and the amount of run-off from that same 
area is to be expected, and it is frequently of value to be able to 
express the degree of relationship in numerical terms. For this 
purpose the product moment coefficient is of great value. 


Annual rainfall over, and run-off from, 
the River Etherow (Oct. 1937—Sept. 1953) 


Rainfall (in.) Run-off (in.) 
(a) (b) 
46.4 31-9 
63:0 46°8 
48:8 34-2 
60-1 47-5 
50:6 35:2 
57°5 40:5 
55°5 41:3 
57:0 43°5 
60:8 44:8 
48-3 38:5 
59-0 39-1 
41:0 26:5 
66:7 46:5 
56:4 43-4 
58-3 40:9 
55:7 41:3 
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Table XXII 


Calculation of the product moment correlation coefficent between 
rainfall and run-off 


Adjusted values for 
rainfall and run-off 
(x) (y) (xy) (x?) (y’) 
10(a— 55) 10(—40) + — 
— 86 — 81 6,966 7,396 6,561 
+ 80 + 68 5,440 6,400 4,624 
— 62 — 58 3,596 3,844 3,364 
+ S51 + 75 3,825 2,601 5,625 
— 44 — 48 P24 Mi IY 1,936 2,304 
+ 25 3. 125 625 25 
+ 5 + 13 65 25 169 
+ 20 + 35 700 400 1,225 
+ 58 + 48 2,784 3,364 2,304 
— 67 — 15 1,005 4,489 DLS 
+ 40 — 9 360 1,600 81 
—140 —135 18,900 19,600 18,225 
+117 + 65 7,605 13,689 4,225 
+ 14 + 34 476 196 1,156 
+ 33 + 9 297 1,089 81 
+ 7 + 13 91 49 169 
+ 51 + 19 +53,627 67,303 50,363 
a zy x xy Des zy? 
319 fe me 119 


£5 = 
2 2 sat, = 433517 


oie. 
Ex a — 3-192 = V/4,206-44 — 10-18 


ae 0.363. 
ae my See /3,147-69 — 1-42 
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When calculating this coefficient for the 16-year period for which 
these details apply, it is more convenient to employ the shorter 
method outlined on pp. 172-173. The reader can again verify the 
accuracy of the result by working through the longer method given 
on pp. 168-170. The first stage in the calculation is to adjust the data 
by subtracting a suitable value from them. In the case of rainfall this 
could be 55 in., while for run-off 40 in. would be convenient. This 
has been done in Table XXII (p. 177), where the resultant values have 
also been multiplied by 10 to remove the decimal points and so 
facilitate later calculations. In this table the necessary further calcula- 
tions are also set out, as are the values for the various components 


i 
\ 
"Al 
\ 
i) 
ca 
Wy 
( 
\ 
\ 
1 
1 
! 
ut 
1 
! 
1! 
3a 
i 


(ls 
Mossuril 


| Correlation coefficient with Mossuril 
EN Pam +075 - + 100 
sie [| +050 - +075 


| OS “rs'| 0-25 = + 050 
| 0-25 - + 025 
— —0:50 - — 025 


Figure 29. The correlation of annual rainfall in Mogambique with that at Mossuril 
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of the formula for the coefficient. The reader should follow these 
stages through carefully. 

Despite the large numbers involved, the calculations here are 
simple ones and the ease with which the respective standard devia- 
tions are obtained is a great saving of time. From the several values 


: ; x 
defined in these calculations (= ; X.¥; 0, and o,} the correlation 
n 


coefficient can be obtained, i.e. 
xxy 

n 
Tas 


__ 3,351:7 — 3-8 
~ 64:8 x 56-1 


= +0-92 


— £7 
t= 


348 


= approximately sa 


As was to be expected, this value indicates a very high degree of 
positive correlation between annual rainfall and annual run-off for 
this drainage basin. Yet a further value of studies such as this is 
that isopleth maps can be drawn based on correlation coefficients. 
Thus in Fig. 29, such coefficients have been calculated for annual 
rainfall between Mossuril in Mogambique and all other climato- 
logical stations in that territory. From the resultant values isopleths 
are interpolated, thus providing a map showing the degree of 
correlation of annual rainfall at Mossuril and that of the rest of 
Mogambique. 


Correlation Significance Test 


The methods of calculating this coefficient have thus been illustrated 
at some length, and the kinds of problems to which the method can 
be applied have also been shown. In all such cases, however, there is 
always the possibility that the coefficient obtained could have 
occurred “by chance’, i.e. that its significance is suspect because of the 
probability of a chance occurrence. Therefore the correlation co- 
efficient must be tested to see whether or not a chance occurrence 
of this magnitude is likely, as a result of the size of the sample or 
set of data analysed. This can be done by the use of the Student’s ¢ 
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distribution, using the following formula: 
r.Vn—2 

i Vi= 9 

where n =the number of pairs of data studied, and where the 

degrees of freedom are (7 — 2). 

In the first example in this chapter, where two sets of crop yields 
were compared, the necessary values were r = +0-87 and n = 10. 
These can be introduced into the formula for Student’s f, with the 
sign of the correlation coefficient always being taken as positive, 
simply for the sake of convenience. So it is seen that 

087 x VI0O—2 087 x V8_ 2-46 


V/1—0872  0:493 0-493 


= 4-99 


The degrees of freedom in this case are simply n — 2=10—2=8. 
By referring this ¢ value and the degrees of freedom to the Student’s 
t graph in Fig. 27, it can be seen that the percentage probability that 
this coefficient could have occurred by chance is only 0-1%. In other 
words, this coefficient is highly significant. This is even more true in 
the case of the other two examples which have been examined, a 
statement which can readily be checked by the reader introducing 
the following values into the formula for Student’s ¢: 


(i) example of crop yields r= —095 n= 10 
correlated with altitude 

(ii) example of run-off r=+092 n=16 
correlated with rainfall 


To save the calculation of these significance levels in every case, a 
graph has been prepared (Fig. 30) from which they can be read 
directly. Thus it will be seen that if only 10 pairs of items are com- 
pared, giving but 8 degrees of freedom, then the correlation co- 
efficient must be either above +0-69 or below —0:69 before it can be 
considered as statistically significant even at the 5% level. On the 
other hand, if about 60 pairs of items are compared, then a co- 
efficient as low as + or —0:25 is statistically significant at this level. 
If, however, a high degree of significance is required (i.e. at the 0°1% 
level) then the coefficient values must be markedly higher, e.g. with 
40 degrees of freedom r must be greater than + /—0:5, a value 
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Figure 30 .Graph of significance levels for correlation coefficients using Student’s 
t distribution 


which was suggested on p. 170 as being necessary as a general overall 
guide. Moreover, this means that in Fig. 29, when m = 20 and the 
degrees of freedom = 18, only those areas with a coefficient greater 
than +-0-6 have a significant correlation with Mossuril at the 1% 
level. 


Spearman’s Rank Correlation Coefficient 


Even with the various means of shortening the calculations, this 
product moment correlation coefficient is not a value which is 
rapidly obtained. At times, therefore, it is convenient to use a rather 
different coefficient which is based not on actual values but rather 
on the relative rank of the values, i.e. where they occur in order of 
magnitude. Apart from this providing a quicker method of assessing 
correlation, there are many occasions when only such rankings are 
available and actual values are not known. The method employed 
in such cases is called Spearman’s Rank Correlation Coefficient. 

The sort of problem which may be considered in this way is shown 
by the following example. It may be known that for five industrial 
areas their relative orders of importance for (a) engineering in 
general and (6) car manufacture in particular are as listed below 


Industrial areas 
(i) (ii) dij) =v) Ss ) 
Engineering 1 2 3 4 5 
Cars 3 2 1 5 4 
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As can be seen, these five areas do not fall in the same order (or rank) 
for these two activities. On the other hand, the three more important 
and the two less important areas are the same in each case. It may 
therefore be useful to assess the degree of correlation there is between 
engineering in general and the manufacture of cars in particular. The 
first stage is to tabulate the data in terms of ‘rank’; then obtain the 
difference between the two sets of data in each case (d), square these 
differences (d?) and sum these squares (& d?)—this is set out below. 


Engineering Cars 


(rank) (rank) d d* 
1 3 Zz 4 
2 2 0 0 
3 1 2 4 
4 5 1 1 
5 4 1 1 
ud? = 10 


This value is then used in the following formula, in which n is the 
number of pairs of occurrences being considered: 
6x d? 


R= | 
ne—n 


In the above example this will give a value of 


6x10 60 60 
Risk es OS ie a a eee Os 
53 — 5 125 —5 : 120 a 
R= +05 


This value suggests some relationship of a positive nature. If the 
significance of this value is checked from Fig. 30, however, the 
degrees of freedom being n —-2=5—2=3, then it will be 
appreciated that this value is not significant statistically at any of the 
given levels. 

The limits of this coefficient are again +1 and —1. Thus if the 
degree of correlation were to be perfect and positive, with the ranking 
the same in each group, then the values for din the above calculation 


would all be 0, as would therefore be both d? and = d*. As a result, 
2 


6xd 
the value mee _would equal 0, so that R would equal 1 —0 = +1. 
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If, on the other hand, the correlation were perfect but negative, the 
following would be the case. 


Set a Set 5 d ad 

1 5 4 16 

2 4 2 4 

3 3 0 (0) 

4 2 2 4 

5 1 4 16 

2d? = 40 

6% d? 6 x 40 240 

aathiee pat es A Been ee Bee wae ew 
# ni—n 538 — 5 ; 120 ate - 


Thus the formula is designed to ensure that +1 and —1 are the 
largest values that can be returned, so that in this way it is comparable 
to the product moment correlation coefficient. 


Comparison of the Two Coefficients 


There is also the question, however, as to whether or not it gives 
the same answer as does the more complicated method, or at least 
one which closely approximates to it. This can be tested by reworking 
the cereal yield data that were used in the first example in this 
chapter. The values, set out in full on p. 167, must be put in rank 
order, and if two are the same they are both given the same rank. 
This can be seen in the table below. 


Districta  Districth d @? 
6 4 2 4 
5 6 1 1 
1 1 0 0 
6 4 2) 4 
i 5 p2 4 
3 3 0 0 
4 2 2 4 
22 3 1 1 
3 2 1 1 
2 2 0 0 
Did 19 
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From this tabulation Spearman’s Rank Correlation Coefficient can 
readily be calculated, i.e. 


6d da? Soong {i4eer ile 
RN sig 0 Ea10, OO Oe B00 
= 1—0-115 
= +0-885 


This differs by only 0-015 from the value given by the longer methods 
on pp. 167-173. Furthermore, on testing for significance from the 
graph in Fig. 30 it can be scen that this value lies between the 1:0% 
and 0-1% levels, i.e. it is statistically significant. With this ranking 
method, however, this testing of significance can only satisfactorily 
be carried out if 7 is not less than 10 (as is just the case here). 
Moreover, as this coefficient is based only on rank and not on actual 
values, it must be used with care if real accuracy is required. The 
considerable shortening and simplification of the calculations 
involved, however, render it of great value especially for obtaining 
a generalized estimate of correlation, quite apart from the fact that 
in many cases only rank may be available for analysis. 

The earlier warning concerning care in interpretation must be 
reiterated here at the end of this chapter on correlation. Such 
methods as those outlined are meant to be useful tools. They do not 
exempt the geographer from the necessity to think in a logical and 
sensible manner. It may well be quite possible to obtain a high 
correlation coefficient of statistical significance between two sets of 
conditions which clearly have nothing to do with each other— 
perhaps, for example, between coal production in Britain and the 
number of penguins in Antarctica in the same years! No one would 
try to suggest that a causal relationship exists between these two 
despite any correlation coefficient that may be obtained. In other 
cases, however, it may be more difficult to decide whether or not 
statistical correlation implies causal relationships. 
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REGRESSION LINES AND CONFIDENCE LIMITS 


In many studies the calculation of a correlation coefficient, in any of 
the ways outlined in the previous chapter, may be sufficient in itself. 
This may indicate the most profitable lines for further research, or 
provide the data from which maps of iso-correlation may be drawn. 
In other cases, however, it may be desirable to take the analysis a 
stage further by calculating the value that might be expected for one 
set of data if some given value occurs in the other set. This could be 
done by separate calculations each time, but it is more effective to 
draw on a graph the line that represents the relationship between the 
two sets of data. The requisite values can then be read off as required. 


Straight-line Regression for Two Variables 


In the case of a perfect positive correlation between two sets of 
data the individual values would be distributed as shown in Fig. 31a. 
They would all fall on a straight line, and this line could be drawn 
through the points without any further calculations. This, in effect, 


Data X Data X 
(A) (B) 


Figure 31. Graphs illustrating differing degrees of correlation and relationship 


is a functional relationship which allows of no minor deviations 
from this straight line and which implies that any change of a given 
magnitude in one set of data must necessarily be associated with an 
exactly comparable change in the other set. Such a relationship is 
rarely found in the problems which confront geographers. Instead 
there is likely to be at best some sort of correlation, the degree of it 
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being reflected by the coefficients outlined earlier. With correlation 
of this sort the distribution of the actual values on a graph will be 
comparable to that shown in Fig. 31b. There is clearly some sort of 
relationship, but it is neither regular nor clear-cut. The insertion of a 
line from which can be ascertained the value of one variable when 
the other variable is known is just not possible in this case, for there 
is no one value which must occur at any given point on the graph. 
Rather there are various possibilities, and the best that can be done 
is to insert a line that will give the closest approximation to the 
relationship at all stages. 

Such a line as this is known as a ‘regression line’. Unlike the 
situation when a functional relationship occurs, i.e. when r = +1 or 
—1 (Fig. 31a), it is not possible to insert a line by eye with any degree 
of accuracy, for such a visual insertion could be no more than guess- 
work. In obtaining the regression line by calculation, the idea is to 
ensure that the sum of the squares of the differences of the individual 
observed values from the line is at an absolute minimum. This is 
known as the method of ‘least squares’. It may be visualized as being 
akin to ensuring that the variance of the individual values in relation 
to the regression line is the smallest value it can possibly be. Clearly, 
in the case of the functional relationship expressed by r = +1 (see 
Fig. 31a) there will be no deviations of actual values from the regres- 
sion line for it passes through all the points. In all other cases there 
will also be one position for the regression line that will ensure that 
the sum of the squares of the differences of the values from that line 
will be the /owest possible value. To find the position of this line by 
trial and error would be both difficult and wasteful. It is therefore 
essential that some method be devised by which to calculate the loca- 
tion and slope of this line. 

Theoretically it would be possible to calculate the minimum value 
for the sum of the squares by setting up the appropriate equation for 
each pair of values being considered. This is a lengthy procedure, 
however, and it is more convenient to apply a formula which gives 
the same result with much less labour. This formula requires not 
only the correlation coefficient but also the average and standard 
deviation values for the two sets of data. These have all been cal- 
culated for the correlation coefficient itself, though if they were 
obtained from ‘adjusted’ values they will need reconverting to actual 
values. Thus if the first example in Chapter 11 be reconsidered for 
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this purpose, it will be seen on p. 172 that the following conversions 
had been made between the original a and b values and the adjusted 
x and y values. 


X=a-—-27 y=b—24 
By rearranging these terms it can be seen that 


a=x+27 b=y+24 

sothatd@d=*+27 b6=jy+24 

=05+27 =-—05+ 24 

= 27°5 = 23-5 
By referring to p. 168, where these values were fully calculated, it can 
be seen that these are the right answers. Also, the standard deviations 
obtained by the methods set out on p. 173 are correct by the values 
given on p. 170. 

In attempting to construct a regression line for the cereal yield 
values for the two districts in the example on p. 167 the following 
values therefore obtain: 

Ga 2tS 6 = 23-5 
Og = 2:73 0, = 48 
r = +0-87 


The formula to be used is written as follows: 
a—a= r.22.(b — b) 
Op 


in which the value a is unknown and the value b is known. In other 
words, the unknown value (qa) differs from the average of its set of 
data (@) by the same amount as the known value (d) differs from its 
average (b), modified by (i) the ratio of the two standard deviations, 
which express the overall spread of values about their respective 
averages and (ii) the correlation coefficient, which expresses the 
degree of actual relationship unit by unit. 
In the present example this becomes 


Be geo (bf) 
O% 
2-73 
a — 275 = 087 x = x (b — 23-5) = 0-495(6 — 23-5) 


a = 0-495b — 11-6 + 27-5 = 0-495b + 15-9 
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Thus the regression of a (unknown) upon b (known) is expressed by 
a = 0-495b + 15:9 

By inserting values for (5) into this equation, the appropriate values 
for (a) can be obtained. Only two such values are required because 
the regression line is a straight line. Thus if = 23-5 then 

a = 0:495b + 15-9 = (0-495 x 23-5) + 15-9 = 11-6 + 15-9 

a= 27°5 ; 

These values for a and b (27-5 and 23-5 respectively) are the average 
values for the two sets of data. So in fact only one value really needs 
calculating since the other one is provided by the two average 
values. The second value in this case can be when b = 20, so that 
a = 0-495b + 15-9 = (0-495 x 20) + 15:9 = 9-9 + 15:9 

a=258 

From these two values (the calculated ones and the averages) it is 
possible to draw the regression line of a on b, as has been done in 


District (a) 


} Bushels 
15 20 Phe: 30 35 40 P 


er 
acre 
District (b) 


eae 32. Regression lines for the relationship between cereal yields for two 
istricts 


Fig. 32. From this line it is possible to make a reasonable assessment 
of what the value for a will be for any given value of b. It is not 
legitimate, however, to try to assess the value of b for any given 
value of a from this same regression line. The formula is designed 
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to ensure that the lowest value is obtained for the sum of the squares 
of the deviations of the a values from the line. The same line will 
only also yield the lowest sum of the squares for the b values if 
r = +/—1. In all other cases, therefore, it is necessary to calculate 
a separate regression line from which to assess b from a so that the 
lowest sum of the squares of the deviations of the b values from the 
line is obtained. 
This is calculated by the same formula, such that 


ee Bier Aa =) 
Oa 
In the present example this would give the following values: 


4-80 
So at ce Bas 57. 
b — 23-5 0:87 X 573 X @ 7:5) 
b = 1-53(a — 27-5) 
b = 1-53a — 42-0 + 23-5 
b = 1-53a — 18:5 


As in the previous case, the insertion of values for a will yield the 
appropriate values for b. Again, however, the two average values 
(a = 27-5 and b = 23-5) give one of the points and only one value of 
a need be inserted. 


©hus. if.a-<= 25.then 6 = (1:53 x 25) — 185 = 38-2 — 18-5 
b= 19-7 


This regression line, from which b (unknown) can be assessed from a 
(known), has also been entered on Fig. 32. It can be seen that it 
differs from the one from which a values can be assessed. The angle 
of difference between these two regression lines reflects the relative 
size of the correlation coefficient. When it is +-/—1 then the two 
lines coincide; when it is 0 then the two lines are at right angles to 
each other; all other values of r give lines which differ from each 
other between these extreme limits. 

In the present example the correlation was positive, and as a result 
the regression lines rise from left to right. If the correlation were to 
be negative then the line would rise from right to left instead. This is 
shown in Fig. 33, which gives the regression line for cereal yields 
(unknown) on altitude (known) based on the data used in the second 
example in Chapter 11 (p. 174). The correlation coefficient for this 
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was —0:95, and the reader can check by the above formula the 
accuracy of the expression for the regression line, i.e. b = 33 — 0-0la 
(where a = altitude and b = cereal yields). In both these examples, 


Altitude (a) 


10 20 30 40 Sushels 
: atre 
Yield (b) 


Figure 33. Regression line and confidence limits of yield (unknown) on altitude 
(known) 


however, it must be stressed that these regression lines are only best 
estimates of the relationship between the two variables; equally the 
value for the unknown variable which this gives is only a best 
estimate. No more than this can be obtained, for with an imperfect 
relationship there cannot be one answer which must be right. 


Standard Errors and Confidence Limits 


For this reason it is desirable to be able to calculate the standard 
error of such estimates, so that the range within which actual con- 
ditions are likely to fall can be assessed with some accuracy. This 
standard error of the estimate of the unknown value (e.g. of a) is 
expressed by the term (Sa) and is calculated by the formula 


Sa =0,.V1 —r? 


With this value obtained, the arguments presented several times 
earlier are again applied, i.e. that there is a 95% probability that 
actual values will differ from the regression line value by not more 
than twice the standard error, and that the probability of values 
differing by more than this amount is only 5°. This means that by 
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the appropriate calculations confidence limits can be obtained in 
relation to the estimated values indicated by the regression analysis. 

In terms of the first example in this chapter (p. 187) the standard 
error of the estimate can be found as follows. If it is the value of a 


that is being estimated, then the standard error Sa = o,.V1 — r? 


ie. Sa = 2:73V1 — 0-872 = 2-73V/1 — 0-76 = 2:73-V/0-24 
= 2:73 x 0-49 = 1-34 
Then 2 Sa = 2°68 


This means that when b = 20 then a = 25:8 (see p. 188) +/—2-68, 
with a 95% probability. In other words, there is a 95% probability 
that the value of a will lie between 23-1 and 28-5. Although such an 
answer to the query ‘what will yields be in “District a” when they 
are 20 bushels per acre in “District b” ?” may not seem as precise as 
saying bluntly 25-8 bushels per acre, it is more accurate and justified. 
Furthermore it reflects the somewhat variable relationship which is 
clearly apparent in Fig. 32. For this example it is equally possible 
to calculate the standard error of the estimate for values of b, when 
it is values of a that are known. In this case the calculations are as 
follows: 
Sb = o,.V1 — r? = 4-80V/1 — 0-872 = 4:80 x 0:49 = 2:35 
Then 2 Sb = 4-70 
Thus when a = 25, then b = 19-7 (see p. 189) +/—4-7, i.e. b will lie 
between 15-0 and 24-4 bushels per acre, with a 95% probability. 
Quite apart from calculating such a standard error for any given 
assessment it is possible to construct lines on the same graph as the 
regression line which will enable the ‘confidence limits’ to be read 
off at a glance. The 95% confidence limits have been entered on 
Fig. 33 which shows the regression line for crop yields (unknown) 
on altitude (known), from the second example in this chapter. The 
regression line itself was expressed (p. 190) as 


b = 33 —0-0la 

while the standard error of this estimate of b becomes 

Sb =0,.V1 — 72? =7-1V1 — (—0-95)? = 7-1 x 0312 = 2:22 

2 Sb = 4-44 

Therefore along each line of altitude points were placed, 4-44 bushels 
above and below the regression line, and these values (one set above 
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the regression line and one set below it) were joined together to give 
the 95% confidence limits. In this way it can be seen that at an 
altitude of 500 ft. there is a 95% probability that cereal yields will 
be between 23-56 and 32:44 bushels per acre. These are wide limits 
but reliable ones. More restricted limits indicating the range of values 
occurring with a 68% probability could be obtained by placing the 
limits only one standard error from the regression line. On the other 
hand, yet more stringent limits of 99-7 % probability could be obtained 
if three standard errors were to be used. 


Further Specific Example 


This whole series of calculations, by which a regression line and 
confidence limits are calculated for two variables between which a 
certain degree of correlation exists, will now be repeated for the 
third example used in Chapter 11, i.e. for the relationship between 
rainfall and run-off. In this way the repetition of the methods will 
help to reinforce the outline presented earlier. 

The correlation coefficient for this example was obtained by the 
shorter method presented in the last chapter (see Table XXII), and 
again it is necessary first to convert the adjusted values to the true 
values for certain parameters. The average values can be obtained as 
follows. It was shown in Table XXII that 
x = 10(@ — 55); so x = 10a — 550; 10a = x + 550; 
a=55 + 0-1x 
Thus d = 55 + 01% = 55 +. 0-319 = 55:32 


Again, y = 10(b — 40); so y = 10b — 400; 10b = y + 400; 

b= 40+ 0-ly 

Thus 6 = 40 + 0:19 = 40 + 0-119 = 40-12 

As for the standard deviations of a and b, these are obtained by 
dividing the standard deviations of x and y respectively by 10 (i.e. 
the amount by which values were multiplied to eliminate the decimal 
points). Thus 


ae 
+4010 
So the data from which the regression line may be calculated are 
G= 5532 6,=648 b=4012 o, =5:61 r= -+0-92 
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These values must then be inserted into the formula for the 
regression of b (run-off) on a (rainfall), i.e. 


Sl a 
tape iho a) 
5-61 
b— 40:12: — 0:92 x, —_ — 55: 
0-12 9 x Fag X @ 55-32) 


b = 0-80(a — 55:32) + 40-12 = 08a — 44-26 + 40-12 
b = 0-8a — 4-14 


For the two points required for the drawing of a regression line one is 
provided by the two averages, i.e. when a = 55-32 then b = 40-12, 
The other is obtained by substitution in the expression for b 

i.e. if a = 60, then b = (0-8 x 60) — 4:14 

= 48 — 4:14 = 43-86 


These two points have been plotted on Fig. 34 and the regression 


wv 
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Annual run-off (b) 


Figure 34. Regression line and confidence limits for the assessment of annual 
run-off from annual rainfall for the River Etherow 
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line drawn along with the plotted values for each of the sixteen pairs 
of observations. 

If such a graph were to be used for assessing the probable run-off 
for given annual rainfalls, it would be desirable to indicate the range 
within which actual conditions are likely to occur with a given 
probability. In other words, it is desirable to enter on the graph the 
‘confidence limits’ for such an assessment. These are obtained in the 
following way 


Standard error of the estimate of b = Sb = o).V1 — r? 


i.e. Sb = 5-61V1 — 0-922 = 5:61V/1 — 0-85 = 5:61-V/0-15 
— 5-61 x 0-38 
= 2:13 


IN orPA Sa) = 4-26 and 3 Sb = 6-39 


Therefore in relation to the two points from which the regression 
line is drawn, the following are the various limits of the confidence 
lines. 


68% prob. 95% prob. 99:7% prob. 
Ifa=60 thenb=4i1:7—460 39:6—48:1 37-5 — 50:25 
and if a = 55:3 then b = 38-0 — 42:25 35:9— 44:4 33-7 — 46:5 


These several values have also been entered on Fig. 34, thus giving 
three sets of confidence limits for this assessment of run-off from 
rainfall data. In this way a guide is given not only to the probable 
run-off from the catchment area but also to the likelihood with 
which such values will occur. These can be read off from Fig. 34, 
but it must be remembered that such values will only hold true if the 
straight-line relationship postulated applies for all rainfall ranges. 
There is here the possibility that as rainfall reaches very high values, 
e.g. about 80 in., then run-off values may deviate from such an 
hypothetical relationship. This is always a problem with regression 
lines, and it is only safe to apply them to the ranges of values on 
which the calculations are based. In this case the regression line 
and confidence limits should be satisfactory for falls at least between 
40” and 65” and almost certainly between 35” and 70”, ie. within 
the likely range of values. Exceptional falls, whether they be high or 
low, may not be so adequately interpreted. 
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Straight-line Regression for One Variable 


The calculation of a regression line between two variables which 
have some correlation with one another is thus a fairly simple 
Operation, the formula presented and used above effecting a ‘least 
squares’ fit of the regression line to the data with a minimum of 
labour. In many problems for which a regression line would be 
useful, however, two correlated variables are not involved. Rather 
the data consist of only one variable, the occurrences of which are 
available for some regular interval either in space or time. The prob- 
lem here is to construct a regression line that will express the relation- 
ship between changing location (in space or time) and changing 
magnitude of the occurrence. Thus if in the earlier example of crop 
yields varying with altitude the crop data had been obtained every 
100 ft. instead of at irregular intervals, then a regression line of 
yields with altitude could have been obtained without first calculating 
the correlation coefficient. Again, in Chapters 2 and 3 several of the 
examples were based on either iron-ore production of four countries 
over a period of twenty years, or annual rainfall values at Bidston fora 
thirty-year period. In both cases these represent data for one variable, 
the values showing conditions at regular intervals. Also in both cases 
a semi-regular change of values with time could be expected as a 
distinct possibility, and such a change (if it does really exist) can be 
represented by a regression line. This theme here impinges on that 
of trends and fluctuations which will be considered at greater length 
in Chapter 13. Therefore, although the methods of calculating such a 
regression line will be outlined here, the implications of such a line 
in terms of trends will be left for consideration in the next chapter. 

In calculating a regression line for data of this sort, two assump- 
tions must be made. The first is that any relationship that exists holds 
true over the whole period or distance, while the second is that the 
relationship can be represented by one specific type of curve or line. 
It is therefore necessary to postulate, for example, that the most 
likely relationship is that which is represented by a straight line; in 
other cases, the exponential curve (p. 203) may be assumed to give the 
best fit to observed conditions. Whichever curve is assumed will 
control not only the calculations but also the conclusions that are 
likely to be drawn from the resulting graph. This fact must always be 
borne in mind. 
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If the rainfall data for Bidston, tabulated in Chapter 2 (p. 11), are 
used for the first example of this method, then it would seem reason- 
able to assume that if there is any change of values with time it may 
well approximate to a straight-line curve. This is not to argue that 
any such change will necessarily take place at a uniform rate through- 
out the period, but simply that as an idealized curve it is likely to be 
reasonably close to reality. In calculating such a linear relationship 
the aim is to assess the number of units by which the variable (in this 
case, rainfall) changes for each unit change of the time or distance 
factor (in this case, successive years). As there is no steady functional 
relationship between time and rainfall, such a study can only provide 
an assessment, and again the regression line is drawn so as to ensure 
that the sum of the squares of the differences of the actual rainfall 
values from this line is at a minimum, i.e. it is based on the ‘least 
squares’ method again. 

If the years involved are listed under (a) and the appropriate 
rainfall under (b), then the number of units (y) which 5 will increase 
per unit increase of a will be obtained by the formula: 


_ Z(a— ab — 5) 
v=" @— a)? 


This formula will ensure that the resulting regression line will fit the 
‘least squares’ requirement. If this formula is considered a little more 
carefully it will be seen that it has some points in common with the 
calculation of the product moment correlation coefficient (p. 169). 
What the formula implies is that if the difference of rainfall values 
from the rainfall average was unit for unit the same as the difference 
of the occurrence number from the average of the occurrence 
numbers, then (a — a) would be the same as (b — 5). In such a case 
the expression (a — d)(b — 6) would be the same as (a — 4)? so that 
the value of (y) in the above formula would be unity. This means 
that the amount by which the value of (y) differs from unity is con- 
trolled by the values of (6 — 5). If these are larger than (a — 4) then 
(y) will be more than unity, while if they are smaller than (a — 4) 
then (y) will be less than unity. In this way it can be seen that 
the value of (y) is based on the relationship of the sum of the 
squares of (a — @) and the sum of the products of (a — a) and 
(b — b). Thus the resulting regression line is located so that the sum 
of the squares of (b — b) is kept to the minimum. 
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The calculation of the necessary values could be done directly 
from the raw data, but this would involve the prior computation 
of the average values and also working would have to be done 
with large numbers. Once again, therefore, a shorter method of 


Table XXIII 


Calculation of a regression line for annual rainfall at Bidston for 
the period 1901-1930 


Years — Rainfall (a@ — 15) (b — 28) 

a b q t gt q? 
1 25 —14 —3 + 42 196 
2 26 —13 —2 + 26 169 
3 34 —12 +6 — 72 144 
4 25 —11 —3 + 33 121 
5 24 —10 —4 + 40 100 
6 28 —9 0 0 81 
7 27 — 8 —1 TES 64 
8 29 —7 54 — 7 49 
9 28 — 6 0 0 36 
10 29 — 5 +1 — 5 25 
11 25 — 4 —3 + 12 16 
12 30 — 3 +2 — 6 9 
13 26 —2 —2 4 4 
14 26 —1 —2 + 2 1 
15 27 ) —1 0 0 
16 25 ered —3 — 3 1 
17 31 +2 +3 + 6 4 
18 32 + 3 +4 + 12 9 
19 29 + 4 ri + 4 16 

20 33 a5 a a5 25 

21 22 +6 =6 — 36 36 

22 26 ed, —2 — 14 49 

23 31 8 +3 + 24 64 

24 33 Se) o + 45 81 

25 28 +10 0 0 100 

26 29 +11 +1 + 11 121 

27 35 +12 +7 + 84 144 

28 29 +13 +1 + 13 169 

29 25 +14 —3 — 42 196 

30 36 +15 +8 +120 225 

+15 +13 +326 +2,255 
=q it x qt x¢ 
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Regression coefficient of ¢ on q, i.e. 
: 15 x 13 
gi asco) 


xX qt — P; “tae 326 — 6:5 319-5 
y ae {5 15 * Doss 7 ae 
lq er) 2,255 — ‘ 
i 30 


Xq 15 
ee od 
G@=G+1S=—* 4 15= 3 4 15 = 155 


is RET eee ka wo SSL 

nh 30 ——- 
calculation is introduced by means of an assumed average value. The 
(a — G) and (b — 5) values are therefore calculated in terms of these 
assumed averages, and corrections for the errors thus introduced are 
incorporated into the formula. The resulting calculations are set out 
in Table XXIII, and are explained below. It will be seen that the a 
values, i.e. the years, are simply listed as 1-30, rather than the actual 
dates themselves. This greatly cuts the size of the values involved. 
Also, in this particular case, the rainfall values are given in whole 
numbers of inches only. This is not part of the standard shortening 
method, but has simply been adopted here to facilitate ready 
computation. 

The first need is to adopt assumed averages for the two sets of 
values (a and 5). These are taken to be 15 for column a and 28 for 
column b, so that the differences from these assumed averages are 
given in the third and fourth columns of the table. They are indicated 
by (g) and (?), such that (¢) represents the difference between the (a) 
values and the assumed average (15), while (¢) represents the differ- 
ence between the (b) values and the assumed average (28). If (a — 4) 
is now represented by (q) and (b — 5) by (¢)—in both cases ignoring 
the fact that the average is only an assumed one—then the formula 
for (y) becomes 

x gt 
eras. f 
q 
Therefore in Table XXIII these two values have been calculated in 
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the fifth and sixth columns, However, the assumed nature of the 
average cannot be ignored, and a correction has to be applied to 
each of these two values to remove any influence resulting from the 
difference between the assumed and actual average values. 

The formula for (y) must therefore be written as 


uq.ut 
Sot— 
q n 
OTT 
ug?— — 
n 


In each case, if the assumed mean happened to be the same as the 
actual mean, then this correction would be 0. 

The values for the present example can now be inserted in this 
formula 


306 — 1595013 
= eal Soateths $96 Legis: mengygis i" ae 
er Poise 22098 75 S475 

2,255 — —— 


This means that for every unit change of (a), i.e. for each year’s 
change, there will be a +0-142 unit change of (b), i.e. a change of 
+0-142 inches. 

From this the two points required for the drawing of the regression 
line are easily obtained. One point is provided by the two average 
values. The calculation of these are set out in Table XXIII, where it 
can be seen that the sum of the qg or ¢ columns (whichever is being 
considered) is divided by the number of occurrences, and this value 
is then corrected by the value of the assumed mean. Thus it can be 
seen that the average value of column a is 15-5, while for column b 
(rainfall) it is 28-43. To obtain the second point a simple substitution 
of values is effected. So if a = 25:5 (i.e. = a+ 10), then b= 5b 
+ 10.y. In the present case this becomes 


b = 28-43 + (10 x 0-142) = 28-43 + 1-42 = 29-85 

From these two points 

a = 15-5 and b = 28-43 

a = 25°5 and b = 29-85 

the regression line has been drawn in Fig. 35. This line suggests that 
throughout the period under review (1901-1930), a slight overall 
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increase in rainfall has occurred, though actual values differ quite 
markedly from the idealized values of the regression line. This theme 


Figure 35. Regression line and con- 
- fidence limits of annual rainfall at 
Bidston for the period 1901-1930 


(b) Annual rainfall (inches) 


5 10 15 20 25 30 
(a) Years (after 1900) 


of the differences between actual and idealized values will be con- 
sidered further in Chapter 13. Finally, the expression for the re- 
gression of annual rainfall at Bidston for this period can be readily 
calculated as follows: 


b—b=y.(a—4) (this follows from the calculation of y—the 
regression coefficient). 

b — 28-43 = 0-142(a — 15-5) 

b = 0:142a — 2:2 + 28-43 

b = 0:142a + 26-23 


This is also entered on Fig. 35, and the second point of the regression 
line could equally have been obtained by substitution in this formula. 


Straight-line Regression for Spatial Change 


This method of calculating a straight-line regression can also be 
applied to data which represent changes in space, with the observa- 
tions taken at regular intervals. Suppose, for example, that remnants 
of a former cliff-line have been plotted over a considerable north- 
south extent of a westward-facing coastline. There are reasons to 
suppose that this area has been warped to a slight extent. However, 
the available data are in a variety of situations, some being at former 
headlands, others in bays or along estuaries, while the rocks on which 
they exist are themselves of varied character. As a result the heights 
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of the cliff bases do not clearly indicate whether warping has taken 
place or not, apparently fluctuating indiscriminately. If the cliff-foot 
heights are available every mile in a straight north-south direction it 
would, however, be possible to calculate the regression line between 
these heights and distance, so that the possible trend can be seen. 
The postulated data are set out in Table XXIV, and the necessary 
calculations are there presented. From these it can be seen that the 
regression coefficient y = —0-1105, ie. that for every unit of 
distance (1 mile) southwards the cliff-foot heights decrease by 
0-1105 ft. From this the regression formula is 


b = 26-11 — 0:1105a 
and this regression line is shown in Fig. 36. 


w 
Oo 


25 


Cliff foot heights (in feet) 


N 5 10 15 20 S 
Distance (in miles) 


Figure 36. Regression line of cliff-foot heights on distance from north to south 


It would also have been possible to obtain a regression line by the 
methods outlined at the beginning of this chapter. Thus a correlation 
coefficient could have been calculated between height and distance, 
and the regression of height on distance obtained. This would neces- 
sarily have involved much more calculation, though several advan- 
tages would have accrued from the results of this extra work. The 
correlation coefficient would have been r = —0-371 and this could 
then be tested for significance. With a sample of only 20 values this 
does not reach the 5% level of significance, but if the sample were to 
be increased to 30 and the same degree of correlation held true, then 
this 5% level of significance would apply. The regression formula 
would be almost the same as in Fig. 36, i.e. b = 26-13 — 0-112a, and 
various confidence limits could also have been calculated. This is not 
so with the present method when only a rough guide can be provided 


201 


STATISTICAL METHODS AND THE GEOGRAPHER 


by inserting limits at 25% of the regression line value above and 
below the regression line itself (see Fig. 36 for an example of this). 
The accuracy of the values quoted here can be checked by the reader 


Table XXIV 


Calculation of a regression line for cliff-foot heights upon north- 
south distance along a westward-facing coast 


Distancein Height above 
1mileunits m.s.J. of : 
from N-S __ cliff-foot (a — 10) (6 — 25) 


a b gq t qt 7 
1 25 ao 0 0 81 
22 28 =e SPe: —24 64 
3 24 = I ap i 49 
4 26 = ¢ Si! = (5 36 
5 28 =) 8) ape =i15) 25) 
6 23 — 4 =e + 8 16 
Il 25 8) 0 0 o 
8 25 = 2 0 0 4 
9 26 al aps == i 1 
10 23 0 —2 0 0 
11 27 aa ee pe 1 
12 2D ee 0 0 4 
13 28 a mee, aye +9 a 
14 79) a4 =o 12 16 
15 24 ak £2 ll = 3 25 
16 23 7276 = —12 36 
17 25 pte 0 0 49 
18 23 + 8 =2 —16 64 
19 24 = oad == = Y 81 
20 As) +10 0 0 100 
+10 = —74 +670 
xq ut xgt aT 
Regression coefficient of ¢(b) on q(a), i.e. 
sey pi ora eal ial 7 gan 
ams aes 95) altel ia tie wih lai abc Rb 
2 
Se (2 9) 670 — 100 O10 <5 665 
n 20 
= —0-1105 
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Average values: 


2q 10 
1—=@ 10 = — id = . 
@= G+ : + 10 59 + 10 = 10:5 


Cae ok —| 
8 od — 25 — 005 = 24°95 


Regression equation: 

b—b=y.(a—4) 

b=y.(a—4) +6 = —0-1105(a — 10-5) + 24-95 
= —0-1105a --'1:16 +- 24-95 

b = 26-11 — 0-1105a 


Points for regression line: 


When a = 10-5 then 5 = 24-95 
When a = 20 then b = 23-9 


from the data and methods presented earlier. The whole comparison 
stresses the fact that the type and value of the data that can be 
obtained by statistical analysis depends on the methods used and the 
amount of work put into the analysis. As a working rule the more 
complete the analysis, the more varied and reliable is the informa- 
tion that is obtained. 


The Exponential Curve 


In all the examples so far analysed in this chapter the basic assump- 
tion has been that the form of the relationship between the data 
approximates to a straight line. This, however, is not always so. In 
studies of population data, for example, it must always be remem- 
bered that the size of the population at one moment in time will 
affect its size at some later moment in time, just as it has itself been 
affected by the size of the population at some earlier period. As a 
result, population values do not always increase from one period to 
another by a uniform and constant amount. Instead, they often tend 
to increase by a uniform and constant rate. Thus the change with 
time is not arithmetic (as has been the assumption in earlier examples) 
but is rather geometric. In this way the increase is not expressed in 
such terms as ‘a 10,000 increase per half-century’ but rather as ‘a two- 
fold increase per half-century’. Moreover, there may often be at least 
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an element of this geometric increase in the case of industrial produc- 
tion values, while there is also frequently a geometric relationship 
between distance along the long-profile of a river and change of alti- 
tude. Before leaving this theme of regression lines it is therefore 
essential that the calculation of such lines on the assumption of a 
geometric relationship be considered, for clearly they are of direct 
application to many problems of geographical interest. Lines such 
as these which have been suggested here are referred to as exponential 
curves, and they are frequently considered as representing a natural 
law of growth by which existing conditions are assumed to affect 
those in the future. 

To illustrate the characteristics of data which fit the exponential 
curve the following simple example provides a suitable starting point. 
If four numbers, set out in succession, are 


| Nee 4 


it is clear that there is a threefold rate of increase from one value to 
the next, i.e. that there is a common rate of increase as distinct from 
a common amount of increase. With a more complex set of values 
this may not be appreciated so easily, especially if the rate of increase 
were not in terms of whole numbers. It is true that the relationship 
even in such a case could be arrived at by trial and error, but this is 
extremely slow and laborious with no guarantee of success. The diffi- 
culty partly arises for the very reason that the absolute difference 
between adjacent values is never constant—thus, in the present simple 
case these differences are 2; 6; 18. What happens, on the other hand, 
if the values involved are changed to logarithms? They then assume 
the values given below: 


Original Difference between 
value Logarithm successive logarithms 
1 0-0 
3 eh aes 
0-47712 
9 0:95424 047712 
27 1:-43136 


Clearly, once the logarithms are considered instead of the original 
values, a constant amount of change is introduced again. 

Once this has been done it is possible to calculate the necessary 
regression line by the same formula as before (pp. 196-200), using the 
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logarithms of the values instead of the values themselves. Hence the 
resulting curve is often referred to as a ‘logarithmic curve’. This is a 
perfectly legitimate device, but it does mean that care must be taken 
to interpret the results aright. As an illustration of the method the 
values given above, which are known to fit the exponential curve 
perfectly, can be examined, tabulating them as follows: 


Log. of 

Items Values values (a— 3) (log b— 1) 

a b log b q t gt ro 

1 1 0:0 —2 —1:00000 +2:00000 4 

2 3 0-47712 —1 —0-52288  +0:52288 1 

S 9 0-95424 0 —0-04576 0 0 

4 27 1-43136 +1 +0:43136 +0:43136 1 
—2 —1°13728 = '4-2:95424 6 
xq xt LX gt xq 


Regression coefficient = log y (i.e. the log increase of b per unit 
increase of a) 


(=2< —1:13728) 


Be 3.95494 
n 


4 

eo 6 — (—2)? 
q 4 

a oo = aes x a = +0-47712 


Thus the same answer is obtained as in the simple tabulation, so it 
can be appreciated that this method yields the correct answers. A full 
application is probably best done in connection with a specific ex- 
ample in which the values only approximate to, and do not perfectly 
fit, the exponential curve. 

Suppose that a study were being made of the colonization of an 
area of tidal flats by some particular plant species. A given section 
of that tidal area may be studied over a period of years, and the 
number of plants of the specific type occurring there is counted each 
year. In the first year the species has only just begun to colonize the 
area, and only three plants were to be seen. With natural regenera- 
tion, however, the numbers increase steadily, the values counted for 
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the first six years of the study being as given in the table below. 
Clearly this increase is not linear, and considering the type of pheno- 
menon being studied it may be expected that the exponential curve 
(of natural growth) would provide a regression line which would fit 
the data more closely. The calculations, using the logarithms of the 
values for the number of plants, are set out in Table XXV. 


Table XXV 


Calculation of the exponential curve for the increase of plants over 
an area of tidal flats 


No. of 
Years plants (a—"3) = (lopibi—s1-5) 
a b log b q t qt 7 
1 3 0:47712 —2 —1:02288  +2:04576 4 
ZZ 8 0:90309 —1 —0°59691 -+0-59691 1 
B) 25 1-:39794 0 —0-10206 0 0 
t 80 190309 meeet- 1 +0:40309 -+0:40309 1 
5 250 239194" --2 +0-89794  -+1:79588 4 
6 700 284510 +3 +1:34510  +4-03530 9 
aS +0:92428  +8:87694 19 
xq xt x gt xq 
aq.ut 
>} qt —- ped los 54 
Regression coefficient = log y = sh 
49)? 
Lq?— ee LAG 
n 
3 xX 0-92428 
8:87694 — ——— 
ai 6 8:87694 — 0:-46214  8-41480 
33 e Ea Sa Gur 
19 — 19 — 1:5 WE 
6 
= +0-48 


Average values: 


0-92482 


nef 75 


2 


3 a 
@=G+3=74+3=35 logh=i+15= 
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From these values it is then possible to calculate both the regression 
equation and the necessary points to draw the regression line. Once 
again, these follow the same form as before, save that the logarithmic 
values are used. 


Regression equation: 

log b — log b = log y(a — @) 

log b = log y(a — 4) + logb 
= 0-48(a — 3-5) + 1-654 
= 0-48a — 1-68 + 1-654 

log b = 0-48a — 0-026 


As for the points from which to draw the regression line, the number 
of such points that are required depends on whether ordinary graph 
paper or semi-logarithmic graph paper is being used. In the first case, 


ry 
~ “Bm 200 
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Figure 37. Regression line (exponential curve) on ordinary and on semi- 
logarithmic graph paper 


all of the values of a from 1 to 6 must be substituted in this formula 
in turn, so that each point is calculated. This will yield a curved line 
as is shown in Fig. 37a. If semi-logarithmic graph paper is being used, 
however, only two points are needed, of which one is provided by the 
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two average values already calculated. This is because the construc- 
tion of the graph paper ensures that a line showing a constant rate of 
increase (i.e. the exponential curve) will be plotted and drawn as a 
straight line (Fig. 37b). These two values could thus be 

(i) when a = 3-5 then log b = 1-654 (and b = 45-08) 

(ii) when a = 6 then log b = 2-854 (and b = 714-50) 

Thus three types of regression lines have been presented in this 
chapter. The first was a straight line related to two variables, between 
which some degree of correlation had already been established. Con- 
fidence limits could also be included with some precision, while the 
significance of the relationship was capable of definition from the 
correlation coefficient. The other two regression lines were concerned 
with but one variable, the values of which were available at constant 
intervals of distance or time. Which of the two is to be used in any 
particular case must be decided by prior consideration of the data, 
the one chosen being that which most closely fits the data. The two 
forms used here were straight and exponential regression lines— 
others of greater complexity should be studied from more advanced 
texts if that is so desired. In all these cases, however, the specific 
purpose of the regression line is to express the relationship between 
data and location (or data and data) as precisely as possible, always 
bearing in mind the fact that there is not an absolute and functional 
relationship between them. The regression line provides the closest 
fit, based upon the ‘least squares’ approach. Once prepared, it repre- 
sents the relationship that exists in terms of the available observations, 
and it thus provides an illustration of relationships as they have 
existed or do exist. Prognostication may be carried out on the basis 
of such lines, but there is no necessary statistical reason why they 
should apply outside the data on which they are based. If the nature 
of the phenomenon under study renders this likely, however, e.g. in 
terms of rainfall and run-off, then these regression lines acquire a yet 
greater value and significance. Such considerations, which are related 
to trends and fluctuations, especially over time, are considered some- 
what more fully in the following chapter. 
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CHAPTER 13 


FLUCTUATIONS AND TRENDS 


In all the problems which have so far been considered in this book, 
the aim has been to reduce or eliminate the detailed differences be- 
tween one particular value and another, so that the overall character- 
istics can more readily be appreciated. Even in the case of correlation, 
where individual values were more directly considered, the purpose 
was to obtain one index which would summarize the full set of indi- 
vidual relationships. With some problems, however, the geographer 
must necessarily concern himself with the details of the changes from 
one individual value to another. This is so when the data consist of 
values which change in relation to changes in the time-scale. Thus it 
is possible that changes in production, in climatic conditions or in 
population values bear some relationship to such time-scale changes. 
This has already been partially indicated in the previous chapter, but 
even there the purpose of the regression lines was to present the 
overall change rather than the details of the actual changes. 


The Simple Graph 


When considering such details of change with time, i.e. when the 
fluctuations of a given set of values are being analysed, it is necessary 
to have recourse to graphical representation. If such fluctuations 
were found to occur with a clearly definable regularity then it would 
be possible to represent this by some mathematical expression. If, 
however, the fluctuations are of an irregular nature then such a 
mathematical summary can only be made at the expense of detail, 
and graphical illustration can give a clearer picture of conditions. 

The simplest method of showing fluctuations is by means of a 
graph in which values of the phenomenon concerned are plotted 
against time and then these points joined by a continuous line. Such 
graphs are shown in Fig. 38 and Fig. 39. The former is for the annual 
rainfall data for Bidston, various characteristics of which have been 
assessed previously, while the second is for the output of crude 
petroleum by the U.S.A. for the twenty years 1937-1956, the values 
for which are set out in Table XXVI. 
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Figure 38. Fluctuation in annual Figure 39. Fluctuation in U.S.A. crude 
rainfall at Bidston, 1901-1930 petroleum production, 1937-1956 


The pattern of change with time of petroleum production is readily 
apparent, the curve being almost universally upwards save on four 
occasions. Each of these falls lasted for only one year, and the picture 
as a whole is both uncomplicated and readily appreciated. On the 
graph of rainfall, however, this simplicity no longer applies. Values 
increase and decrease with apparent irregularity, and the definition 
of periods of rising or falling values, or of spells of wetter or drier 
years, becomes increasingly subjective. Moreover, if an attempt were 


Table XXVI 


U.S.A. crude petroleum production, 1937-1956, in millions of metric 
tons 


Year Production Year Production 
1937 173 1947 251 
1938 164 1948 273 
1939 171 1949 249 
1940 183 1950 267 
1941 189 1951 304 
1942 187 1952 309 
1943 203 1953 319 
1944 227 1954 313 
1945 232 1955 336 
1946 234 1956 354 
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to be made to compare such a graph of fluctuations with a similar 
graph for some other station, it would prove exceedingly difficult to 
pass any worthwhile judgment on whether or not the details of the 
fluctuations bore any relationship one to the other. 


Running Means 


With difficult cases such as this, there are other devices that may 
be used to simplify the task of judgment and assessment. The first 
of these aims at smoothing out the sharp and marked irregularities 
that can be seen in Figs. 38 and 39 so that only the major fluctuations 
are stressed and so need be considered. This can be effected by the 
calculation of ‘running means’. This implies that if ‘five-year running 
means’ are being used, for example, then the first value will be the 
average of years 1-5; the second value will be the average of years 
2-6; the third value will be the average of years 3-7 etc., until the 
final five years of the period. For the Bidston data the first two values 
would be as follows: 


First five-year Second five-year 


Years Rainfall mean mean 
1901 25-19 

1902 25:57 

1903 34-42 26°87 

1904 25°18 27-45 
1905 24-01 

1906 28-08 


Any number of years may be the basis for such a smoothing tech- 
nique, but it must be borne in mind that if there were to be a regular 
periodicity in the fluctuations of the same length as the runnin g-mean 
period, then such a regular fluctuation would not appear in the resul- 
tant graphs. It is therefore usually desirable to prepare such graphs 
for two periods of different lengths. These Bidston data have there- 
fore been changed into both ‘five-year’ and ‘ten-year’ running means, 
and the respective graphs are shown in Fig. 40. In both, an overall 
though interrupted increase in rainfall values is indicated. On the 
basis of the five-year periods, values seem markedly to increase after 
the period 1913-1917 (mid-year 1915), although smaller fluctuations 
are seen to occur both before and after this period. From the values 
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for the ten-year periods it would seem that rainfall increased after 
the decade 1913-1922, or possibly after 1904-1913, to a maximum 
in 1918-1927, while again smaller fluctuations are also apparent. 
Such differences as these are, however, really differences between 
sample means. Therefore before any further reasoning or conclusions 
are based on these apparent differences, they should be tested by the 
methods outlined in Chapter 8 to assess whether they could well have 
occurred by chance, or whether they are statistically significant. One 
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Figure 40. Graphs of running means for annual rainfall at Bidston, 1901-1930 


possibility is to use the ‘standard error of the difference’ test, but as 
the size of the samples is relatively small it is better to apply Student’s 
t Test. Thus in the case of the ten-year running means it would be 
desirable to test whether the difference between the driest decade 
(1904-1913) and the wettest decade (1918-1927) is statistically sig- 
nificant or not. 

The basic parameters of average and standard deviation required 
for the application of Student’s ¢ Test can be calculated from the data 
in Table I (p. 11). They are as follows: 


Sample 
Sample standard 
Decade average deviation 
(a) 1904-1913 P| 192 
(b) 1918-1927 29:8 3:66 
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From these, Student’s ¢ can be calculated thus: 


pees bly. Dig, 620k 298 
S,2 S,? ea 13-34 
. —1 + n—1 a - 9 
27 DG | 2-7 
= ee 1-96 


ie Gate i4as: «41-00 2 1375) = 
The degrees of freedom are 
(m +n, —2)= 10+ 10—2=18 


and by reference to Fig. 27 it can be seen that these values do not 
quite reach the 5% level. Thus there is a probability of just more than 
5% that a difference as great as this could have occurred by chance, 
so that it is not fully justified to argue that this difference is a prob- 
ably significant one. However, if conditions at neighbouring stations 
indicated a change over the same period that was statistically signifi- 
cant, then it would be reasonable to treat a case such as this as falling 
in the same category—though still maintaining a certain element of 
possible doubt. 

The application of such a test is not merely a nuisance imposed by 
statistical requirements. It can rather be a positive help in focusing 
attention on those differences which really are statistically significant 
and in avoiding the tendency to explain smaller differences which are 
quite likely to be solely chance occurrences. Thus if the five-year run- 
ning means were to be considered, the difference apparent in Fig. 40 
between 1913 and 1917 (average value 26-82") and 1923-1927 
(average value 31-11”) would at first sight appear to be an important 
one. By applying Student’s ¢ Test, and having calculated that the re- 
spective best estimates of the standard deviations from the two samples 
was 2:22” and 2:80”, it can be found that ¢ = 1-684 with 8 degrees of 
freedom. From Fig. 27 this is shown to represent a difference between 
sample means that could have occurred by chance with a probability 
of greater than 10%. Thus in this case the apparent fluctuation in- 
volving a change in five-year means of the order of 4-37" cannot be 
accepted as statistically valid, and further evidence must be sought 
before such a fluctuation should be accepted as a reasonable possi- 
bility. 

213 


STATISTICAL METHODS AND THE GEOGRAPHER 


Cumulative Deviations from the Mean 


One difficulty with using running means is that even if a statistically 
significant change were to be established, it would not be possible to 
indicate exactly when such a change became effective. This renders 
comparisons rather difficult, while any attempt at assessing causal 
relationships from such graphs is equally hindered. Such difficulties 
are largely overcome if a different sort of graph is used instead. This 
graph is designed to show cumulative deviations from the mean, either 
in absolute or percentage terms. Only simple calculations are required 
for this. First the difference between each occurrence and the mean 
value is obtained, and these values are tabulated. The points on the 
graph are then calculated by progressively summing these differences, 
ie. the first point is the difference between the first value and the 
mean; the second point is the sum of this difference and the difference 
between the second value and the mean; and so on to the end of the 
record. This is perhaps more clearly seen from Table XXVII using 
the petroleum data for the U.S.A. given in Table XXVI. 


Table XXVII 


Calculation of values for graphs of cumulative (percentual) deviations 
from the mean 


Difference Cumulative 
Values from mean difference % difference 
Aiea) 
(when x (x — X).100% 
(x) X = 247-4) =x (x — ¥) fe 
173 —74:4 — 74:4 — 30-05 
164 — 83-4 —157:8 — 63:7 
171 —76:4 — 234-2 — 94:8 
183 — 64-4 —298°6 —120°8 
189 —58-4 — 357-0 —144-4 
etc. etc. etc. etc. 


Curves based on such calculations are presented in Fig. 41 and 
Fig. 42 for these petroleum data and for the Bidston rainfall data. 
In the case of the former, values are expressed as percentages of the 
mean, while in the latter they are shown as absolute values in inches. 
From the petroleum graph (Fig. 41) it can be seen that a series of 
lower-than-average years were followed, from 1947 onwards, by a 
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series of above-average years. In Fig. 42, the dominance of drier- 
than-average years prior and up to 1916 is clearly seen, while the 
greater frequency of occurrence of wetter-than-average years after 
this date is also clear. It must be stressed, however, that actual 
position on the graph is irrelevant when an interpretation is being 
made in terms of rate and direction of change. The significant fea- 
tures are the direction and angle of slope of the graph. Whenever this 
rises it indicates an increase in values (even if this occurs where the 
graph reads —200%), while the steeper it rises the more rapid and 
marked that increase happens to be. Equally, however, if the rate at 
which the line falls gets less, then this indicates an increase in values 
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Figure 41. Graph of cumulative percentual 
jeviations from the mean for U.S.A. crude 
etroleum production, 1937-1956 


Figure 42. Graph of cumulative devia- 
tions from the mean for Bidston annual 
rainfall, 1901-1930 


even though such increased values are still below the mean itself. 
Clearly the date at which a series of below-average conditions are 
replaced by a series of above-average conditions can be readily 
appreciated. On the other hand, a certain amount of practice is re- 
quired for the ready interpretation of the graph in Fig. 41. This indi- 
cates a virtually continuous rise in values by the steadily decreasing 
rate at which the line falls and then its conversion to a rising line. 
Whichever of these methods is used the reason for using it is to 
represent the changes that have taken place with time. This may be 
desired simply to specify conditions at that one place or for that one 
commodity. At other times the purpose may include a comparison 
with the changes that have occurred elsewhere or in some other pro- 
duct. In neither case, however, can or should these methods be used 
to project beyond the actual period of the data. They are indicators 
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of the past, not harbingers of the future. If such assessment is con- 
sidered desirable, then the study of these detailed changes is best 
replaced by the regression lines which were outlined in Chapter 12. 
These regression lines are, in effect, trend lines which generalize the 
overall changes that have taken place. Even in these cases, however, 
great care should be taken to ensure that the factors that have caused 
this trend are likely to continue in the future, or—if the attempt is 
made to project back to the past—that they applied there too. Thus 
in the case of annual rainfall at Bidston, the facile assumption that 
the trend over 1901-1930 has always applied and will continue into 
the future would mean that in the twenty-first century the expected 
annual rainfall there would be over 40”, while at the beginning of the 
eighteenth century there would have been no rain at all! An absurdity 
such as this is only too apparent, but in other cases care must be 
taken to ensure that similar false reasoning is not applied. In the 
study of population, for example (see pp. 217-221 and Fig. 43), in- 
numerable factors including health, nutrition, migration and changing 
social customs are likely to confound any forecast of future popula- 
tions based solely on a projection into the future of the population 
regression line from the past. 


Deviation from a Trend Line 


The construction of regression lines to represent past trends can 
be of value in geography in another way. Being concerned with the 
variability of sets of data, the geographer is presented with a problem 
when the set of data itself includes a distinct trend throughout the 
period. In such cases, the calculation of variance and standard dev- 
iation values can be somewhat misleading, for they will be com- 
pounded of two elements, (i) the overall trend from the beginning to 
the end of the period and (ii) the variability of conditions from one 
occurrence to the next, which clearly occurs when the actual values 
do not perfectly fit the trend line. Thus in terms of the data on U.S.A. 
petroleum production, the overall trend reflects a steady increase 
throughout the period 1937-1956, but actual values nevertheless 
varied in relation to this trend. The calculation of variance values by 
the normal method for these 20 years may be legitimate as a statistical 
device by which to summarize the characteristics of conditions over 
those particular 20 years. It should not be assumed, however, that it 
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also fairly represents the longer series of data from which those 20 
years were drawn. Such a variance is not the result of values varying 
at random about the mean value, but rather it is a statistical ab- 
straction which gives an inadequate picture of a set of data in which 
a consistent trend is occurring. 

The same is true in terms of population values. Given a series of 
population data which consists of census returns at—say—ten-year 
intervals, it would be possible to calculate in the normal way the 
variance of these values in relation to the average of the body of data. 
However, because of the tendency for population to increase from 
one decade to another, it would also be possible to calculate the 
variance of the actual conditions in relation to those represented by 
an overall trend line. Such a value would reflect the combined influ- 
ences of all those factors other than that of natural growth which the 
trend line (assuming that the exponential curve is used) would itself 
define. The variance or standard deviation, obtained in the normal 
way, would be dominated by the factor of natural growth, and these 
other factors—which may well be the important factors differentiat- 
ing one area from another—would be largely obscured. 

An example in terms of population data will help to clarify this 
approach, and illustrate the type of problem that is amenable to it. 
The following set of values could well represent the population of a 
small rural parish at ten-year intervals over a period of 70 years. By 
normal methods it can be calculated that the mean of these values is 
512-5, that the best estimate of the standard deviation is 89 and that 
the coefficient of variation is 17-4%. 

Decadal 

Decade returns 
(x) 
390 
435 
475 Suggested population 
480 values for 
500 a rural parish 
550 
620 
650 


AANDNBRWN KR 


Clearly, however, there is a trend throughout this period which dis- 
plays a continuous though variable increase, so that these deviation 
and variation values reflect not only fluctuations but also this tendency 
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for continued growth. To separate these two elements it is desirable 
to construct a regression line that expresses this rate of growth and 
then to calculate the degree of fluctuation that occurs around this 
line. The form of the regression line will reflect the basic hypothesis 
concerning population growth. If it were to be the exponential curve, 
then the assumption would be that the major element of growth had 
been natural increase. If it were to be a straight line, then the assumed 
relationship would include some other basic factor (e.g. migration) 
acting concurrently with natural growth. These assumptions would 
necessarily affect the ultimate interpretation of any values of devia- 
tions from these trends that might be defined. In the present case 
either of these two hypotheses could be put forward, but for purposes 
of this example the law of exponential growth will be assumed. 

The first requirement is therefore to construct the appropriate 
regression line, the necessary calculations for which are set out in 
Table XXVIII following the procedure already outlined on pp. 206- 
208. From these it will be seen that the equation for the regression 
line is 
log b = 0-03a + 2-569 
For drawing on semi-logarithmic graph paper only two sets of values 
would be required from this, but as they are needed for later calcula- 
tions the hypothetical values for each of the eight points are pre- 
sented. 


Table XXVIIT 


Calculation of the exponential curve for population data 


log of 
Decade Population population (a— 4) (logb— 2:7) 
(a) (6) (log 5) @) (t) (qt) (9°) 
1 390 2°5911 3 —0-1089 +-0:3267 9 
2 435 2°6385 =p) —0-0615 +0-1230 4 
3 475 2:6767 —1 —0-0233 +0-0233 1 
4 480 2:6812 0 —0-0188 0 0 
») 500 2-6990 4 —0-0010 —0-0010 1 
6 550 2-7404 +2 +0:0404 +0-:0808 4 
7 620 2:7924 r3 +0:0924 +0:2772 9 
8 650 2:8129 +4 +0°1129 +0:4516 16 
+4 +0-0322 +1:2816 +44 
xq xt Xgt xq? 
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Regression coefficient: 


xuq.d . 
ne a Seats 1.2816 — +X 9:0322 
log y =. ire 4 8 41 '2000 
(2 4)* (oy a m2 
xg? — —— = 
q n a 8 
= 0003 
Averages: 
4 —_ : 
Sows Bi Gi Barf ig jogo =f +27= 0 1 27 


Regression formula: 

log b — log b = log y(a — 4) 

log b = log y(a — 4) + log b = 0-03(a — 4-5) + 2-704 
log b = 0-03a + 2-569 


Table XXIX 


Calculation of the coefficient of variation of actual decadal values 
from hypothetical decadal values based on the exponential regression 


(hypothetical 5) (actual b) (iv —iii) 


line 

i ii lii 

(a) (log d) 

1 2°599 397-2 
2 2:°629 425°6 
3 2-659 456-0 
4 2-689 488-7 
5) 2°719 523-6 
6 2-749 561-0 
7 2:779 601-2 
8 2°809 644-2 


iv 


390 
435 
475 
480 
500 
550 
620 
650 


uf 


= if 
+ 9:4 
+19-0 
eon 
—23°6 
Ic) 
+18°8 
ar Oe 


vii 
(vi) 


3-28 
4-88 
17-40 
3:17 
20°25 
3-85 
9-80 
0-85 
7)63°48 


9:07 


Coefficient of variation, or percentage standard deviation = V9-07 


Ul 
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From the actual and hypothetical values for the population set out 
in Table XXIX calculation proceeds very much along the lines of that 
for the y? Test. Thus the difference between the observed (or actual) 
values and the expected (or hypothetical) values are first obtained 
(see also Fig. 43). It is then possible to work with these as percentages 
of the expected value and these differences (column v in Table XXIX) 
have been transferred to percentages of their respective expected 
values in column vi of the same table. This is necessary because the 
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--—-- straight line ‘curve’ 


Figure 43. Deviation of observed population data from an exponential regression 
line (based on A. Geddes, Geographical Review, 32 (1942) and 44 (1954)) 


variability is being measured from the trend line, not from one mean 
value as in the case of the normal standard deviation. These per- 
centage deviations from the trend are then squared, the values 
summed and divided by (m — 1) rather than by (n), because of the 
small size of the sample. This gives, in percentage terms, the best 
estimate of the variance of the population values from the trend line. 
This value is here 9-07% and the square root of this, i.e. 3-01%, 
gives the best estimate of the percentage standard deviation (or the 
coefficient of variation) of these population data about the expo- 
nential regression line that represents the trend. This can be com- 
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pared to the value of 17-4% given on p. 217, reflecting this variability 
plus the overall trend. This small value is the result of those factors 
that are not incorporated in the trend line. Such values as these allow 
the comparison between different units to be made, in terms of the 
extent to which population changes in those units deviate from the 
hypothetical changes (here based on the exponential curve). 

The calculation of this deviation value involves considerable 
labour, which can be decreased to some extent provided that approxi- 
mation is permitted. Thus it can be seen from Fig. 43 that in the 
present example the exponential curve differs but slightly from a 
straight line. Therefore, having obtained the differences between the 
actual values and the hypothetical ones (column v in Table XXIX) 
it is possible to obtain the standard deviation from these in the usual 
way, and then express this as a percentage of the overall average 
obtained on the assumption that the curve is a straight line. In this 
way, while the individual deviations are measured from the right 
place, they are expressed as a percentage of the one value rather than 
of a different one each. Moreover, this one value is not the true mean, 
but is obtained by halving the sum of the lowest and highest values 
(this is the true mean only if the curve is a straight line). In the present 
example these several approximations almost balance each other out. 
Thus the standard deviation of the values from the exponential curve 
is 15-3 while the average of the values, if it is assumed that they fall 
on a straight line, is —— = 520-7. The percentage value 
therefore is 2-94, only a little less than that obtained by the longer 
method. The difference between the precise method and this more 
approximate one is usually small, provided that the rate of increase 
of the exponential curve is not very great. 

A further example, this time based on a straight-line regression, 
will emphasize the method again, and also allow of a comparison 
being made between two sets of data. Assume that, for some par- 
ticular crop, comparisons of yields over a ten-year period were made 
between two widely different areas, in which the techniques of pro- 
duction and land management also were different. Despite this, the 
average values for these two areas were found to be the same (i.e. 
20 bushels per acre) as also were the standard deviations of the two 
sets of data (i.e. 3-16 bushels or a percentage value of 15-8%). The 
yields for these two areas for each year were as set out overleaf. 
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Area I Area II 
16 24 
17 23 
24 24 
23 20 
20 20 
20 23 
23 17 
24 17 
il 16 
16 16 


Clearly the year-to-year variations are markedly different, and it 
would, of course, be possible to compare these by means of a correla- 
tion coefficient. Equally, however, it would be possible to calculate 
a straight regression line for each set of data, to see whether any over- 
all trend occurred. By applying the methods outlined in Table XXIII, 
the regression formulae would be: 

Areal b=-+20 where a = the time-scale 
Areal] 5b = 25:4 — ee and b = the crop yield 

Thus in Area I there is no overall trend at all (see Fig. 44a), so that 
the percentage variability value of 15-8% reflects variability about 
one constant value, i.e. it illustrates the influence of such factors as 
annual variations in climate or seed quality etc. The actual cause can- 
not, of course, be obtained without further analysis of possible causa- 
tive factors. In Area II, however, a marked overall trend can be seen 
(Fig. 445) which consists of a fall in yields as time passes. Thus there 
is some dominant factor at work leading to decreasing yields (e.g. 
decreasing soil fertility because of agricultural practices; a progres- 
sive deterioration in climate), and this factor is hidden in the overall 
variability of 15-8%. In such a case it is useful to be able to separate 
the variability that results from factors other than those that induce 
declining yields, and this can be done by the present technique of 
assessing variability from the trend. 

The necessary calculations for this are set out below. These consist 
of first obtaining for each year the values from the trend, either from 
the graph in Fig. 44d or by calculating from the regression formula 
given above. Then the difference between each observed or actual 
value and these expected trend values is obtained, and expressed as 
a percentage of the appropriate trend value. These percentages are 
squared and summed, this value being divided by (n — 1) to give the 
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Figure 44, Straight-line regressions for crop yields for two areas for the same ten 


years 


best estimate of the percentage variance, and finally the square root 
obtained to yield the best estimate of the percentage standard devia- 
tion. This is now seen to be 7-6%, so that the variability due to fac- 
tors other than those producing the overall decline is markedly 


smaller than in the first case. 


Yield Yield 

Year (trend) (actual) Difference 
1 24-418 24 0-418 
2 23-436 23 0-436 
3 22:454 24 1-546 
4 21-472 20 1-472 
5 20-490 20 0-490 
6 19-508 23 3-492 
Ul 18526") 17 1-526 
8 17544 17 0-544 
9 16562 16 0:562 

10 15-580 16 0-420 


% standard deviation = V58-28 = 7-6% 


Difference 
% %? 
1:7 2:89 
1:9 3-61 
6:9 47-61 
6:9 47-61 
2-4 5-76 
17:9 321-31 
8-2 67:24 
3-1 9-61 
3-4 11-56 
27 7:29 
9)524-49 
% variance = 58-28 
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Rhythmic Fluctuations 


In the graph of crop yields for Area I (Fig. 44a) it can be seen that 
there appears to be some semi-rhythmic fluctuation in the values, this 
rhythm being so regular that it can be smoothed out into a trend line 
that displays no overall change. Fluctuations that follow, regularly 
or irregularly, a semi-rhythmic pattern require yet more advanced 
techniques for their definition and elucidation. Many phenomena, 
whose data cannot be represented by one trend (whether straight-line 
or exponential), may nevertheless correspond to the overlapping of 
several dissimilar rhythms or waves. To define these requires some 
ability in harmonic analysis, a technique that must remain beyond 
the scope of an introductory book such as this. The reader is referred 
to more advanced statistical texts if some proficiency in harmonic 
analysis is required for research purposes. If, in a long series of 
data, there are clearly several distinct trends, it is, however, always 
possible to compute the regression for each of these periods separ- 
ately. This will provide a closer approximation to the trend in such 
cases than will the reliance on one simple regression line that groups 
several smaller but distinctive trends together. Thus, although this 
must remain a ‘second best’ as compared to harmonic analysis, being 
both generalized and in part subjective, it does provide a simple 
method of making a first approximation to a series of regression lines 
that will show changes in trends from one period to another. 
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CHAPTER 14 


SCOPE FOR THE FUTURE 


Throughout all the preceding chapters the conscious aim has been to 
present, as simply as possible, the basic elements of a wide range of 
statistical techniques. All of these techniques are standard ones and 
have been widely applied in many fields of study, where they form 
an essential tool in the analysis of numerical data. Without such tech- 
niques these fields of study would not have progressed as steadily and 
effectively as they have done. They have allowed the conclusions of 
experimental or observational studies to be presented in a form that 
is common to all fields that attempt to express their results quantita- 
tively. Furthermore, the use of these methods helps to reduce the ele- 
ment of subjective judgment in so many ways, thus ensuring that 
from the same set of data different workers will arrive at roughly the 
same conclusion. In this way it is possible for studies to be repeated 
so that cross-checking of results can be effected, while it also means 
that the mental reasoning by which a certain conclusion is arrived at 
is clearly apparent to all later workers. The gains thus include greater 
clarity, objectivity, orderliness and precision. 

This is not to argue, however, that only conclusions based on 
statistical methods are of any validity. While some problems lend 
themselves to analysis by such methods, being concerned with quan- 
titative data of one sort or another, others can only be resolved by 
personal assessment based on experience, ability and the proper 
understanding of the phenomena under study. Even in these cases, 
however, it is often true that such personal assessments can be con- 
siderably assisted and facilitated by the use of statistical analysis at 
one or more stages in the study. Equally, experience, ability and 
understanding are essential before any study based on statistical 
methods can be expected to yield valuable and relevant results. In 
other words, statistical techniques are simply a series of special tools 
which can be of as much assistance in the study of geographical 
problems as they have proved to be in problems of the pure sciences, 
other field sciences and the social sciences. This does not mean that 
they will be of equal value in all problems that confront geographers, 
any more than palaeography, pollen analysis or surveying are always 
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relevant to any particular problem. It does mean, however, that 
whenever statistical analysis is relevant and is required, then the 
geographer should use such techniques to the fullest extent that is 
necessary to ensure a satisfactory solution of his problem. 

The use of such techniques necessarily implies a proper understand- 
ing of them. Only in this way can a sound choice be made between 
differing methods, the data be organized in a suitable form, and the 
correct interpretation be made of the results. This is just as important 
if, as often in more complex problems, the geographer must seek 
guidance from a professional statistician, otherwise the most refined 
techniques may lead to erroneous conclusions through mis-interpreta- 
tion. For such an understanding the simple concepts and techniques 
presented in this book are essential. In many studies these simpler 
techniques will be all that is required, but even if more advanced and 
complex techniques are needed for particular problems most of them 
will be found to be related to these simple concepts. 

The major practical problem in applying these or other techniques 
is likely to be related to the time consumed in making the necessary 
calculations. While it is true that practice greatly increases speed (and 
one hopes accuracy, too), some mechanical means of assistance is 
invaluable. Facility with a slide rule is almost a sine qua non if numer- 
ous calculations are being made, and with this the individual student 
can cope with quite substantial calculations in a reasonable space of 
time. For larger problems, however, especially when the body of data 
is considerable, a mechanical calculating machine is almost indis- 
pensable. Desk models, operated either manually or electrically, can 
allow of great quantities of data being processed with perfect accuracy 
and relatively little strain. In view of the wide range of geographical 
problems that can be approached, at least in part, via statistical 
analysis, it would seem more useful for geography students to be 
proficient with a calculating machine than with a theodolite or 
meteorological instruments, with their more limited application! At 
research level, of course, it is now possible to employ electronic com- 
puters to effect lengthy and involved calculations exceedingly rapidly, 
although the time taken up in the initial card punching and program- 
ming should always be borne in mind. Nevertheless the existence of 
such computers in most universities, as well as in many private estab- 
lishments, now opens up the possibility of tackling fairly quickly 
problems of a magnitude that formerly could not have been contem- 
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plated. The large-scale study need no longer be either excessively 
generalized or else a lifetime’s task, but rather it may be a major 
project lasting a period of two to three years. The use of such com- 
puters necessarily requires training and practice, but with assistance 
from those directly concerned with them this is a feasible proposi- 
tion. It does mean, however, that the simple techniques of statistical 
analysis must be known and understood not only by those carrying 
out such studies, but also by all geographers who are going to use 
or interpret the results obtained in this way. 

Finally it must be stressed that facility with any of these techniques, 
whether they be simple or complex ones, will only come by continued 
use and practice. This is especially true for those geographers—and 
they are the majority, unfortunately—who have used mathematical 
methods for little more than everyday purposes since the age of 
sixteen. Once a certain familiarity has been established with these 
methods, however, the possible uses of them become increasingly 
apparent. The problems presented in this book represent but a very 
small proportion of the type of problem that could have been con- 
sidered, and gradually these techniques will be expanded into all 
those aspects of geography where they have any relevance. Provided 
that these methods of analysis are then kept in their proper place, 
i.e. as a tool by which geographical studies can be furthered, and not 
as an end in themselves, they can provide a positive contribution to 
the expansion and value of geography as a whole. 
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The following books represent a brief cross-section of the large 
literature on statistical methods and their application. All are con- 
cerned with basic methods, some of which have been applied in 
geographical studies but many of which have not. Those few books 
that have been listed here provide texts in English to suit virtually all 
levels of analysis that geographers are likely to require. By reference 
to these it will be possible both to expand the examples of the simple 
methods that have been outlined in this book, and also to consider 
numerous more advanced techniques, some of which have been re- 
ferred to in passing in the previous pages. 
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FORMULAE INDEX 


The variations of type indicate the following : 
6 (chapter); 6 (page); 6 (figure); VI (table) 


Analysis of variance, 143-5, XIX 
Arithmetic average, 6; (grouped 
frequency method), 31, VI 


Bessel’s correction, 79 

Best estimate of the standard 
deviation, 80; variance, 80 

Binomial frequency distribution, 
Terms of the, 62, XI 


Charlier’s test, 35, VII 
Chi-squared test for one variable, 
154, 28; two variables, 160, 
28 
Coefficient of variation, 40 
Correction for small (finite) popu- 
lations, 83 
Correction for small samples, 79 
Correlation, Product moment co- 
efficient, 169, 172, 175, XX, 
XXI 
significance test, 180, 30 
Spearman’s rank coefficient, 182 
Co-variance, 169, 171 
Cumulative deviations from the 
mean, 214, XXVII 


Deviation, Mean, 20 
Quartile, 19 
Standard, 22, 26 
(grouped frequency method), 
aA! 

Deviation from an exponential 
trend, 219, XXIX; (with 
approximations), 221 

a straight-line trend, 223 

Dispersion diagrams as a test of 

significance, 118, 25 


Mean deviation, 20 


Median, 7 
Mode, 7 


Normal distribution function, 54, X 
Normal frequency distribution, Per- 
centage points of the, 48, IX 
Probability and the, 57, 59 


Pascal’s triangle, 63, XI 
Poisson frequency distribution, 
Terms of the, 69, XI 
Product moment correlation co- 
efficient, 169, XX 
(alternative form), 172 
(tabulated form), 175, XXI 


Quartile deviation, 19 


Random sampling numbers, Table 
of, 91, XII 
for areas, 94, 23 
Regression 
exponential, calculation, 205, 
XXV; regression equation, 207 
straight-line with one variable, 
calculation, 198, XXIII 
regression equation, 200 
straight-line with two variables, 
calculation, 187 
standard error of the estimate, 
190 
Relative variability, 40 
Running means, 211 


Sample size (binomial distribution), 

88 
(normal distribution), 85 

Small (finite) populations, Correc- 
tion for, 83 

Small samples, Correction for, 79 

Snedecor’s variance ratio (F) test, 
141, XIX 
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Spearman’s rank correlation co- 
efficient, 182 
Standard deviation, 22 
alternative formula, 26 
Best estimate of the, 80 
grouped frequency method, 31, 
VI 


Standard error of the 
coefficient of variation, 131 


difference between coefficients of 


variation, 131 

difference between sample means, 
123 

mean, 77, 79 

mean (binomial), 87 
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standard deviation, 78, $1 
Standard errors with a stratified 
random sample 
with a uniform sampling fraction, 
106, XV 
with a variable sampling frac- 
tion, 112, XVIII 
Student’s ¢ test, 124-5, 27 


Variance, 22 

alternative formula, 26 

Analysis of, 143-5, XIX 

Best estimate of the, 80 

ratio test, Snedecor’s F, 141, XIX 
Variation, Coefficient of, 40 
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The variations of type indicate the following: 
6 (chapter); 6 (page); 6 (figure); VI (table) 


Analysis of variance, 9, 133-50, 
XIX 

Arithmetic average, 5, 6, 7, 8, 9, 10, 
15, 16, 30-2, 45, 16, I, II 

grouped frequency method, 29- 
, VI, Vil 

Assumed mean or average, 30, 32, 
33.135 +136, 198, VE Vil 

Average, see Arithmetic average 


Bessel’s correction, 79 
Best estimate, from regression lines, 
190 
of the standard deviation, 79-80, 
82-3, 101, 130 
of the variance, 80 
Biased sample, 90-8 
Binomial frequency distribution, 
5, 60-7, 69, XI 
Powers of, 64 
Probability and, 60-7, 69 
Sampling and, 67, 85-9, 96-8, 
106-8, 22, 24 
Standard error of, 85-9, 97, 106- 
108 


Cell interval, 30, 32, 34, VI, VII 
Cells, see Classes 
Charlier’s test, 35, 36, VII 
Check, Three standard deviations, 
49-50 
Chi-squared test, 10, 99, 151-66, 
28 
Classes, 1, 2, 10, 11, 13, 14, 29, 30, 
32, 34, VI, VU 
Class limits, 11, 12, 30, VI, VIL 
mid-marks, 30, 32, 33, 34, VI, 
VII 
Coefficient, of variation, 40-3, 130- 
132, 15, VIII 
Comparison of, 130-2 


of Correlation, see Correlation 
coefficient 
Confidence limits, 190-2, 194, 202, 
33, 34, 36 
Correction factor, 32, 33, 136, 199, 
VI, VII 
Correlation, 11, 167-84 
Correlation coefficient, Product 
Moment, 167-79, 201, 29, 
XX, XXI, XXII 
significance test, 179-81, 182, 
184, 201, 30 
Spearman’s Rank, 181-4 
Co-variance, 169, 171, 173, 175, 177 
Cumulative deviations, 214-16, 4/, 
42, XXVII 


Degrees of freedom, 81, 82, 125, 
137, 138, 139, 144, 145, 146, 
148, 154, 162, 213 

Deviation, 3, 17-44 

Cumulative, 214-16, 41, 
XXVII 
from exponential trend, 217-21, 
43, XXIX 
from straight trend, 221-3, 44 
Mean, 20, 21, 22, 24, 25, 40, 12, 
Jy, O00 UY 
Quartile, 18-19, 39-41, VIII 
Root mean square, see Standard 
Standard, 21, 22, 24-9, 33, 40, 
973.475.1227 16-1, TV,.V, 
XXII 
(grouped frequency method), 
2930322. Vi5 VL 
Best estimate of the, 79-80, 
82-3, 101, 130 
Standard error of the, 77-8, 131 
True (or population), 81 

Difference, Standard error of the, 

121-5, 127-30 


42, 
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Dispersion diagram, 17, 18, 19, 117- 
121, 25, 26 


Error, Standard, see Standard error 

Estimate, Best, see Best estimate 

Exponential curve, 195, 203-8, 217- 
221,'°37,-43; SXVIT, XXX 


F, Snedecor’s, 139-42, 145, 146, 
149, XIX 


Factor, Correction, 32, 33, 136, 199, 


VI, Vil 
Raising, 101, 108, 110, XVI, 
XVII, XVIII 
Finite (or small) populations, 83-4, 
Fluctuations, 13, 195, 209-24, 38, 
39, 40, 41, 42, 43, 44, XXVI, 
XXVII 
Rhythmic, 224 
Formulae, see FORMULAE INDEX 
Fraction, Sampling, 83, 99 
Variable sampling, 107-12, XVI, 
XVU, XVIII 
Freedom, Degrees of, 81, 82, 125, 
137, 138, 139, 144,.145, 146, 
148, 154, 162, 213 
Frequency distribution, 4, 5, 2, 7, 
8, 9, 14, 17, 45-59, 60-7, 68- 
Vaplole (OO 163.0105. bee. 
10, 12 
Binomial, 5, 60-7, 69, XI 
Powers of, 64 
Sampling and, 67, 85-9, 96-8, 
22, 24 
Normal, 4, 8, 45-59, 74, 163, 3, 
Ms Pal 
Probability and, 47, 52-9, 65- 
66;,09;122, 1819. 202 1X, X 
Poisson, 5, 68-72, XI 
Skew, 2, 8, 9, 12, 15, 16, 68, 86, 
151, 163,°4, 5,°6, 10; 22 
Function, Normal distribution, 53, 
54, 69, X 


Graphs, Cumulative deviation, 214~ 
216, 41, 42 
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Exponential curve, 207-8, 37 
Running means, 211-13, 40 
Simple, 209-11, 38, 39 
Grid sampling, 93-5, 23 
Grouped frequency method, 29-37, 
VI, VI 


Harmonic analysis, 224 

Histograms, 2, 7, 29, 1, 2, 7, 8 

Hypothesis, Null, 140, 153, 156, 
157, 163, 165 


Indices, Variability, 37-43, 13, 14, 
15, VII 
Inter-quartile range, 19 


Law, Multiplication, 62, 160 

Least squares, 186, 196, 208 

Line sampling, 95, 23 

Logarithmic curve, see Exponen- 
tial curve 


Mean, Assumed, 30, 32, 33, 135, 
136, 198, VI, VII 
deviation, 20, 21, 22, 24, 25, 40, 
12, 5 TOP. 
Sample, 73, 74, 75, 121, 122, 27 
Standard error of the, 75-7, 87, 
97, 102-4, 106-8, 109-10, 111, 
113, 921, 122, 131 
True (or population), 73-83, 100- 
102 
Means, 2, 5-16, 3, 4, 5, 6, 7, 8, 9, 
10, I, I 
Advantages and disadvantages 
of, 13-16 
Definitions of, 6-7 
Relationships between, 7-9, 5, 10 
Running, 211-13, 40 
Median, 5-10, 14, 15, 16, 18, 19, 
117-20, 9) 1725.6 at 
Mode, 6-9, 12, 13, 14, 18, 7, 2, 3, 4, 
I,6,97, OS lOe 
Multiplication law, 62, 160 


Normal distribution function, 53, 
54, 69, Xx 
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frequency distribution, 4, 8, 45- 
D9, TAP1G638 3.17, 27 
Percentage points of the, 48, [IX 
and Probability, 47, 52-9, 65- 
66, 69, 122, 18, 19, 20, IX, X 
Null hypothesis, 140, 153, 156, 157, 
163, 165 


Pascal’s triangle, 63, XI 
Point sampling, 94, 96, 23, 24 
Poisson frequency distribution, 5, 
68-72, XI 
Population, estimate, Standard er- 
ror of the, 104-5, 111-12 
mean, 73-83, 100-2 
Small (or Finite), 83-4, 101-2 
standard deviation, $1 
Statistical, 4, 73 
Probability, Allocation of, 51 
Binomial frequency distribution 
and, 60-7, 69 
maps, 55, 56, 18, 19 
Normal frequency distribution 
and, 47, 52-9, 65-6, 69, 122, 
18, 19, 20, 1X, X 
Poisson frequency distribution 
and, 68-72, XI 
theory, 50-2 
Product moment correlation co- 
efficient, 167-79, 201, 29, XX, 
XXI, XXII 


Quartile deviation, 18-19,39-41, VII 
Quartiles, 18, 19, 117, 118, 120, JJ, 
25, 26 


Raising factor, 101, 108, 110, XVI, 
XVI, XVUL 
Random sampling, 73, 90-9, XII 
number, 91, XII 
of areas, 93-5, 23 
Rank correlation coefficient, Spear- 
man’s, 181-4 
Regression lines, 12, 185-208, 218- 
PBI ES 0, BI tho 50isS! 5 
43, 44, XXIII, XXIV, XXV, 
XXVII 


and confidence limits, 190-2, 194, 
202633, 345 35 
Exponential, 195, 203-8, 217-21, 
37, 43, XXV, XXVIII 
Standard error of, 190-2 
Straight-line, one variable, 195- 
203,. 221-2; 35, 36, 44, 
XXIII, XXIV 
two variables, 185-94, 32, 33, 34 
Relative variability, 40-2, 13, 14, 
Vill 


Rhythmic fluctuation, 224 
Running means, 211-13, 40 
and significance tests, 212-13 


Sample, Biased, 90, 98 
meats. 74, 1),2te 20 
Random, 73, 90-9, XII 
size, 76, 80, 84-9, 95, 157-9, 201, 
22 
Small, 80, 110-11 
Sampling, 6, 7, 4, 72, 73-89, 90-114 
and binomial frequency distri- 
bution, 67, 85-9, 96-8, 106-8, 
22, 24 
error, see Standard error 
fraction, 83, 89 
Variable, 107-12, XVI, XVII, 
XVII 
Grid, 93-5, 23 
Line, 95, 23 
Point, 94, 96, 23, 24 
Random, 73, 90-9, XII 
numbers, 91, XII 
of areas, 93-5, 23 
Stratified, 99-112, XIII, XIV, 
XV, XVI, XVII, XVIII 
Systematic, 112-14 
Skewness, 2, 3, 8, 9, 12, 15, 16, 68, 
86,15, 163,475, 6, 10722 
and Poisson frequency distribu- 
tion, 68 
and variability, 41-3 
Small (or finite) populations, 83-4, 
101-2 
samples, 80, 110-11 
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Snedecor’s or variance ratio test, 
139-42, 145, 146, 149, XIX 
Spearman’s rank correlation co- 
efficient, 181-4 
Squares, Least, 186, 196, 208 
sum (Of the,.33, 103; 437, 138, 
139, 144, 145, 146, 148 
Standard deviation, 21, 22, 24-9, 
S300 N73 el 75512: 16,011, 
TV, Vi exexa 
(grouped frequency method), 29, 
30, 32, VI, VII 
Best estimate of the, 79-80, 82-3, 
101, 130 
Standard error of the, 77-8, 
131 
True (or population), 81 
Standard error of the coefficient of 
variation, 131 
difference, 121-5, 127-30 
mean, 75-7, 102-4, 109-10, 111, 
113, 121, 122-32; (binomial), 
87, 97, 106-8 
population estimate, 104-5, 111- 
112 
regression lines, 190-2 
standard deviation, 77-8, 131 
Standard errors when random 
sampling with: the binomial 
distribution, 106-8; a constant 
sampling fraction, 99-106, 
XIli, XIV, XV; a variable 
sampling fraction, 107-12, 
XVI, XVII, XVII 
systematic sampling, 113 
Statistical population, 4, 73 
significance, 8, 9, 10, 116-22, 124, 
126, 128-31, 134, 140, 142, 
143, 145, 147, 149, 155-8, 170, 
179-81, 182, 184, 191, 192, 
213,25, 26,027,.08; 30, X1X 
Stratified sampling, 99-112, XIII, 
XIV, XV, XVI, XVII, XVIII 
Student’s ¢ test, 81, 82, 101-2, UT, 
124-6, 128-30, 170, 180, 212- 
DNS eS 0 
Sub-strata, 112 
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Sum of the squares, 33, 103, 137, 
138, 139, 144, 145, 146, 148 
Systematic sampling, 112-14 


t test, Student’s, 81, 82, 101-2, 111, 
124-6, 128-30, 170, 180, 212- 
213; 27,-30 

Test, Charlier’s, 35, 36, VII 

Chi-squared, 10, 99, 151-66, 28 

Correlation significance, 179-81, 
182, 184, 201, 30 

Snedecor’s variance ratio (F), 
139-42, 145, 146, 149, XIX 

Student’s ¢, 81, 82, 101-2, 111, 
124-6, 128-30, 170, 180, 212- 
INE A, SY) 

Three standard deviations check, 
49-50 

Trends, 13, 195, 209-24, 38, 39, 40, 
41, 42, 43, 44, XXVI, XXVII 

Deviation from, 216-23, 43, 
XXIX 
Triangle, Pascal’s, 63, XI 
True mean, 73-83, 100-2 
standard deviation, 81 


Variability indices, 37-43, 13, 14, 
Ty AGM 
Relative, 40-2, 13, 14, VII 
Variable sampling fraction, 107- 
112, XVI, XVII, XVII 
Variance, 21, 22, 25, 26, 27, 29, III, 
WY 
Allocation of the, 133-9, 145, 
146, 148 : 
Analysis of, 9, 133-50, XIX; 
(shorter method), 146-7, 149 
estimate, 139-42 
of sample mean, 122 
ratio test, Snedecor’s (F), 139- 
142, 145, 146, 149, XIX 
Variates, Continuous, 1, 2, 10, 30 
Discrete, 1, 27, 68 
Variation, Coefficient of, 40-3, 
130-2, 15, VII 
Comparison of, 130-2 


x test, 10, 99, 151-66, 28 
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